SlideShare ist ein Scribd-Unternehmen logo
1 von 105
Downloaden Sie, um offline zu lesen
Mining online data for public
health surveillance
Vasileios Lampos (a.k.a. Bill)
Computer Science
University College London
@lampos
Structure
➡ Using online data for health applications
➡ From web searches to syndromic surveillance
i. Google Flu Trends: original failure and correction
ii. Better feature selection using semantic concepts
iii. Snapshot: Multi-task learning for disease models
Online data
22/03/2017
Online data for health (1/3)
— coverage
— speed
— cost
When & why?
How?
Evaluation?
— collaborate with experts
— access to user activity data
— machine learning
— natural language processing
— (partial) ground truth
— model interpretation
— real-time
Online data for health (1/3)
— coverage
— speed
— cost
When & why?
How?
Evaluation?
— collaborate with experts
— access to user activity data
— machine learning
— natural language processing
— (partial) ground truth
— model interpretation
— real-time
Online data for health (1/3)
— coverage
— speed
— cost
When & why?
How?
Evaluation?
— collaborate with experts
— access to user activity data
— machine learning
— natural language processing
— (partial) ground truth
— model interpretation
— real-time
Online data for health (1/3)
— coverage
— speed
— cost
When & why?
How?
Evaluation?
— collaborate with experts
— access to user activity data
— machine learning
— natural language processing
— (partial) ground truth
— model interpretation
— real-time
Online data for health (2/3)
Google Flu Trends (discontinued)
Online data for health (2/3)
Flu Detector, fludetector.cs.ucl.ac.uk
Online data for health (2/3)
Health Map, healthmap.org
Online data for health (3/3)
Health
intervention
Impact?
Online data for health (3/3)
Health
intervention
Impact?
Online data for health (3/3)
Vaccinations
against flu
Impact?
Lampos, Yom-Tov, Pebody, Cox (2015) doi:10.1007/s10618-015-0427-9
Wagner, Lampos, Yom-Tov, Pebody, Cox (2017) doi:10.2196/jmir.8184
Google Flu Trends (GFT)
0 10 20 30 40 50 60 70 80 90 100
0.90
45 queries
0.85
0.95
Meancorrelation
2004 2005 2006
Year
2007 2008
2
4
6
8
10
ILIpercentage
0
12
Figure 2 | A comparison of model estimates for the mid-Atlantic region
(black) against CDC-reported ILI percentages (red), including points over
which the model was fit and validated. A correlation of 0.85 was obtained
over 128 points from this region to which the model was fit, whereas a
NATURE|Vol 457|19 February 2009 LETTERS
Google Flu Trends (GFT)
0 10 20 30 40 50 60 70 80 90 100
0.90
45 queries
0.85
0.95
Meancorrelation
2004 2005 2006
Year
2007 2008
2
4
6
8
10
ILIpercentage
0
12
Figure 2 | A comparison of model estimates for the mid-Atlantic region
(black) against CDC-reported ILI percentages (red), including points over
which the model was fit and validated. A correlation of 0.85 was obtained
over 128 points from this region to which the model was fit, whereas a
NATURE|Vol 457|19 February 2009 LETTERS
f ?
GFT — Supervised learning
Regression
Observations (X): Frequencies of n search queries for a
location L and m contiguous time intervals of length τ
Targets (y): Rates of influenza-like illness (ILI) for L
and for the same m contiguous time intervals, obtained
from a health agency
Learn a function f such that f: X ∈ ℝn⨯m
⟶ y ∈ ℝn
GFT — Supervised learning
Regression
Observations (X): Frequencies of n search queries for a
location L and m contiguous time intervals of length τ
Targets (y): Rates of influenza-like illness (ILI) for L
and for the same m contiguous time intervals, obtained
from a health agency
Learn a function f such that f: X ∈ ℝn⨯m
⟶ y ∈ ℝn
count of qi
total count of all queries
frequency of qi =
GFT v.1 — Model
ermined by an automated method described below. We f
near model using the log-odds of an ILI physician visit an
log-odds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
ere P is the percentage of ILI physician visits, Q is
ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries
Percentage (probability) of doctor visits
Regression bias term
Regression weight (one weight only)
independent, zero-centered noise
dom physician visit in a particular region
nza-like illness (ILI); this is equivalent
ILI-related physician visits. A single
was used: the probability that a random
ted from the same region is ILI-related, as
tomated method described below. We fit
the log-odds of an ILI physician visit and
I-related search query:
(P) = β0 + β1 × logit(Q) + ε
ntage of ILI physician visits, Q is
fraction, β0 is the intercept,
Figure 1:
ILI-relate
points du
queries. A
which is
0
0.85
0. 9
Meancorrelation
mple model which estimates the
physician visit in a particular region
ike illness (ILI); this is equivalent
elated physician visits. A single
used: the probability that a random
rom the same region is ILI-related, as
ted method described below. We fit
og-odds of an ILI physician visit and
ated search query:
= β0 + β1 × logit(Q) + ε
e of ILI physician visits, Q is
Figure 1: An
ILI-related q
points durin
queries. A st
0
0.85
0. 9
Meancorrelation
hich estimates the
in a particular region
); this is equivalent
an visits. A single
ability that a random
region is ILI-related, as
escribed below. We fit
LI physician visit and
ery:
(Q) + ε Figure 1: An evaluation of how
ILI-related query fraction. Maxi
0 10 20 30
0.85
0. 9
Meancorrelation
Q
P
GFT v.1 — Model
ermined by an automated method described below. We f
near model using the log-odds of an ILI physician visit an
log-odds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
ere P is the percentage of ILI physician visits, Q is
ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries
Percentage (probability) of doctor visits
Regression bias term
Regression weight (one weight only)
independent, zero-centered noise
dom physician visit in a particular region
nza-like illness (ILI); this is equivalent
ILI-related physician visits. A single
was used: the probability that a random
ted from the same region is ILI-related, as
tomated method described below. We fit
the log-odds of an ILI physician visit and
I-related search query:
(P) = β0 + β1 × logit(Q) + ε
ntage of ILI physician visits, Q is
fraction, β0 is the intercept,
Figure 1:
ILI-relate
points du
queries. A
which is
0
0.85
0. 9
Meancorrelation
mple model which estimates the
physician visit in a particular region
ike illness (ILI); this is equivalent
elated physician visits. A single
used: the probability that a random
rom the same region is ILI-related, as
ted method described below. We fit
og-odds of an ILI physician visit and
ated search query:
= β0 + β1 × logit(Q) + ε
e of ILI physician visits, Q is
Figure 1: An
ILI-related q
points durin
queries. A st
0
0.85
0. 9
Meancorrelation
hich estimates the
in a particular region
); this is equivalent
an visits. A single
ability that a random
region is ILI-related, as
escribed below. We fit
LI physician visit and
ery:
(Q) + ε Figure 1: An evaluation of how
ILI-related query fraction. Maxi
0 10 20 30
0.85
0. 9
Meancorrelation
Q
P
GFT v.1 — Model
ermined by an automated method described below. We f
near model using the log-odds of an ILI physician visit an
log-odds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
ere P is the percentage of ILI physician visits, Q is
ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries
Percentage (probability) of doctor visits
Regression bias term
Regression weight (one weight only)
independent, zero-centered noise
dom physician visit in a particular region
nza-like illness (ILI); this is equivalent
ILI-related physician visits. A single
was used: the probability that a random
ted from the same region is ILI-related, as
tomated method described below. We fit
the log-odds of an ILI physician visit and
I-related search query:
(P) = β0 + β1 × logit(Q) + ε
ntage of ILI physician visits, Q is
fraction, β0 is the intercept,
Figure 1:
ILI-relate
points du
queries. A
which is
0
0.85
0. 9
Meancorrelation
mple model which estimates the
physician visit in a particular region
ike illness (ILI); this is equivalent
elated physician visits. A single
used: the probability that a random
rom the same region is ILI-related, as
ted method described below. We fit
og-odds of an ILI physician visit and
ated search query:
= β0 + β1 × logit(Q) + ε
e of ILI physician visits, Q is
Figure 1: An
ILI-related q
points durin
queries. A st
0
0.85
0. 9
Meancorrelation
hich estimates the
in a particular region
); this is equivalent
an visits. A single
ability that a random
region is ILI-related, as
escribed below. We fit
LI physician visit and
ery:
(Q) + ε Figure 1: An evaluation of how
ILI-related query fraction. Maxi
0 10 20 30
0.85
0. 9
Meancorrelation
Q
P
GFT v.1 — Model
ermined by an automated method described below. We f
near model using the log-odds of an ILI physician visit an
log-odds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
ere P is the percentage of ILI physician visits, Q is
ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries
Percentage (probability) of doctor visits
Regression bias term
Regression weight (one weight only)
independent, zero-centered noise
dom physician visit in a particular region
nza-like illness (ILI); this is equivalent
ILI-related physician visits. A single
was used: the probability that a random
ted from the same region is ILI-related, as
tomated method described below. We fit
the log-odds of an ILI physician visit and
I-related search query:
(P) = β0 + β1 × logit(Q) + ε
ntage of ILI physician visits, Q is
fraction, β0 is the intercept,
Figure 1:
ILI-relate
points du
queries. A
which is
0
0.85
0. 9
Meancorrelation
mple model which estimates the
physician visit in a particular region
ike illness (ILI); this is equivalent
elated physician visits. A single
used: the probability that a random
rom the same region is ILI-related, as
ted method described below. We fit
og-odds of an ILI physician visit and
ated search query:
= β0 + β1 × logit(Q) + ε
e of ILI physician visits, Q is
Figure 1: An
ILI-related q
points durin
queries. A st
0
0.85
0. 9
Meancorrelation
hich estimates the
in a particular region
); this is equivalent
an visits. A single
ability that a random
region is ILI-related, as
escribed below. We fit
LI physician visit and
ery:
(Q) + ε Figure 1: An evaluation of how
ILI-related query fraction. Maxi
0 10 20 30
0.85
0. 9
Meancorrelation
Q
P
GFT v.1 — “Logit”, why?
the logit function
nts the space of possible query fractions and ILI percentages, T and
and queries respectively. For a certain week, ∈y denotes the IL
ery volumes; for a set of T weeks, all query volumes are represent
y analysis found that pairwise relationships between query rates an
e logit space, motivating the use of this transformation across all e
S3); related work also followed the same modeling principle13,23
= ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the
se manner; similarly X denotes the logit-transformed input matri
alues for a particular week t. Predictions made by the presented m
tion before analysis.
proaches for search query modeling proposed linear functions o
3
selected search queries. In particular, GFT’s regression model r
β+ w·z+ ε, where the single covariate z denotes the logit-transfo
y of a set of queries, w is a weight coefficient we aim to learn, β de
dependent, zero-centered noise. The set of queries is selected thro
see Supplementary Information [SI], Feature selection in the GFT m
this basic model mispredicted ILI in several flu seasons, with signi
α ∈ (0,1)
0 2 4 6 8 10 12
x
0
0.2
0.4
0.6
0.8
1
y
(x,y) pair values
0 2 4 6 8 10 12
x
-10
-5
0
5
10
logit(y)
(x,logit(y)) pair values
GFT v.1 — “Logit”, why?
the logit function
nts the space of possible query fractions and ILI percentages, T and
and queries respectively. For a certain week, ∈y denotes the IL
ery volumes; for a set of T weeks, all query volumes are represent
y analysis found that pairwise relationships between query rates an
e logit space, motivating the use of this transformation across all e
S3); related work also followed the same modeling principle13,23
= ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the
se manner; similarly X denotes the logit-transformed input matri
alues for a particular week t. Predictions made by the presented m
tion before analysis.
proaches for search query modeling proposed linear functions o
3
selected search queries. In particular, GFT’s regression model r
β+ w·z+ ε, where the single covariate z denotes the logit-transfo
y of a set of queries, w is a weight coefficient we aim to learn, β de
dependent, zero-centered noise. The set of queries is selected thro
see Supplementary Information [SI], Feature selection in the GFT m
this basic model mispredicted ILI in several flu seasons, with signi
α ∈ (0,1)
0 2 4 6 8 10 12
x
0
0.2
0.4
0.6
0.8
1
y
(x,y) pair values
0 2 4 6 8 10 12
x
-10
-5
0
5
10
logit(y)
(x,logit(y)) pair values
0 2 4 6 8 10 12
x
-3
-2
-1
0
1
2
yorlogit(y) y logit(y)
values close to 0.5 are “squashed”
border values (close to 0 or 1) are “emphasised”
z-scored
GFT v.1 — Data
9 US regions considered
50 million search queries (most frequent) geolocated in
these 9 US regions
Weekly ILI rates from CDC
170 weeks, 28/9/2003 to 11/5/2008 with ILI rate > 0
First 128 weeks: Training, 9 x 128 = 1,152 samples
Last 42 weeks: Testing (per region)
GFT v.1 — Feature selection (1/2)
1. Single query flu models are trained for each US region

50 million queries x 9 US regions = 450 million models
2. Inference accuracy is estimated for each query using
linear correlation (Pearson) as the metric
3. Starting from the best performing query, adding up one
query each time, a new model is trained and evaluated
1. Single query flu models are trained for each US region

50 million queries x 9 US regions = 450 million models
2. Inference accuracy is estimated for each query using
linear correlation (Pearson) as the metric
3. Starting from the best performing query, adding up one
query each time, a new model is trained and evaluated
GFT v.1 — Feature selection (1/2)
0 10 20 30 40 50 60 70 80 90 100
0.90
Number of queries
45 queries
0.85
0.95
Meancorrelation
Figure 1 | An evaluation of how many top-scoring queries to include in the
2004
2
4
6
8
10
ILIpercentage
0
12
Figure 2 | A com
(black) against C
which the model
over 128 points f
correlation of 0.9
indicate 95% pre
NATURE|Vol 457|19 February 2009
GFT v.1 — Feature selection (2/2)
Because localized influenza surveillance is particularly use
public health planning, we sought to validate further our
Table 1 | Topics found in search queries which were found to be m
related with CDC ILI data
Search query topic Top 45 queries Next 55 quer
n Weighted n W
Influenza complication 11 18.15 5 3.
Cold/flu remedy 8 5.05 6 5.
General influenza symptoms 5 2.60 1 0.
Term for influenza 4 3.74 6 0.
Specific influenza symptom 4 2.54 6 3.
Symptoms of an influenza
complication
4 2.21 2 0.
Antibiotic medication 3 6.23 3 3.
General influenza remedies 2 0.18 1 0.
Symptoms of a related disease 2 1.66 2 0.
Antiviral medication 1 0.39 1 0.
Related disease 1 6.66 3 3.
Unrelated to influenza 0 0.00 19 28
Total 45 49.40 55 50
The top 45 queries were used in our final model; the next 55 queries are presented
comparison purposes. The number of queries in each topic is indicated, as well as q
GFT v.1 — Performance (1/2)
Evaluated on 42 weeks (per region) from 2007-2008
Evaluation metric: Pearson correlation
μ(r) = .97 with min(r) = .92 and max(r) = .99
Performance looked great at the time, but this is not a
proper performance evaluation!



Why?

Potentially misleading metric (not the loss function here)
and rather small testing time span (< 1 flu season)
GFT v.1 — Performance (2/2)
100
the
p 45
2004 2005 2006
Year
2007 2008
2
4
6
8
10
ILIpercentage
0
12
Figure 2 | A comparison of model estimates for the mid-Atlantic region
(black) against CDC-reported ILI percentages (red), including points over
which the model was fit and validated. A correlation of 0.85 was obtained
over 128 points from this region to which the model was fit, whereas a
correlation of 0.96 was obtained over 42 validation points. Dotted lines
indicate 95% prediction intervals. The region comprises New York, New
Jersey and Pennsylvania.
LETTERS
Mid-Atlantic US region
Pearson correlation, r = .96
GFT v.2 — Data & evaluation
weekly frequency of 49,708 search queries (US)
filtered by a relaxed health topic classifier, intersection of
frequent queries across all US regions
from 4/1/2004 to 28/12/2013 (521 weeks)
corresponding weekly US ILI rates from CDC
test on 5 flu seasons, 5 year-long test sets (2008-13)
train on increasing data sets starting from 2004,
using all data prior to a test period
GFT v.1 was simple to a (significant) fault
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
ILIrate(US)
CDC ILI rates
GFT
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
ILIrate(US)
CDC ILI rates
GFT
“rsv” — 25%
“flu symptoms” — 18%
“benzonatate” — 6%
“symptoms of pneumonia” — 6%
“upper respiratory infection” — 4%
GFT v.1 was simple to a (significant) fault
GFT v.2 — Linear multivariate regressionargmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
0
@
NX
(xiw + yi)2
+ 1
MX
|wj| + 2
MX
w2
j
1
A
X 2 Rn⇥m
Rm
i 2 R
w 2 Rm
2 R
2 R
2 R
y 2 Rn
R
2 R
yi 2 R
Rm
ILI rates from CDC for n weeks
… for week i
2 R
xi 2 Rm, i 2 {1, . . . , n}
2 Rn
frequency of m search queries for n weeks
… for week i
weights for the m search queries
intercept term
Least squares
GFT v.2 — Linear multivariate regressionargmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
0
@
NX
(xiw + yi)2
+ 1
MX
|wj| + 2
MX
w2
j
1
A
argmin
w,
@
i=1
(xiw + yi)2
+ 1
j=1
|wj| + 2
j=1
w2
j
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
which have a multivariate Gaussian distribution.
i 2 R
w 2 Rm
2 R
2 R
2 R
y 2 Rn
R
2 R
yi 2 R
Rm
ILI rates from CDC for n weeks
… for week i
argmin
w,
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
which have a multivariate Gaussian distribution.
frequency of m search queries for n weeks
… for week i
weights for the m search queries
intercept term
Least squares
GFT v.2 — Linear multivariate regressionargmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
0
@
NX
(xiw + yi)2
+ 1
MX
|wj| + 2
MX
w2
j
1
A
X 2 Rn⇥m
Rm
i 2 R
w 2 Rm
2 R
2 R
argmin
w,
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
which have a multivariate Gaussian distribution.
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
ILI rates from CDC for n weeks
… for week i
2 R
xi 2 Rm, i 2 {1, . . . , n}
2 Rn
frequency of m search queries for n weeks
… for week i
weights for the m search queries
intercept term
Least squares
GFT v.2 — Linear multivariate regressionargmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 
MX
j=1
w2
j
1
A
Elastic net
0
@
NX
(xiw + yi)2
+ 1
MX
|wj| + 2
MX
w2
j
1
A
X 2 Rn⇥m
Rm
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
GPs can be defined as sets of random variables, any finite
2 R
y 2 Rn
R
2 R
yi 2 R
Rm
ILI rates from CDC for n weeks
… for week i
2 R
xi 2 Rm, i 2 {1, . . . , n}
2 Rn
frequency of m search queries for n weeks
… for week i
weights for the m search queries
intercept term
Least squares
GFT v.2 — Linear multivariate regression
X
argmin
w,
nX
i=1
(xiw + yi)2
!
X 2 Rn⇥m
Rm
i 2 R
w 2 Rm
2 R
2 R
2 R
y 2 Rn
R
2 R
yi 2 R
Rm
ILI rates from CDC for n weeks
… for week i
2 R
xi 2 Rm, i 2 {1, . . . , n}
2 Rn
frequency of m search queries for n weeks
… for week i
weights for the m search queries
intercept term
Least Squares
⚠ ☣ ⚠
Least squares regression is not applicable here
because we have very few training samples (n)
but many features (search queries; m).
Models derived from least squares will tend
to overfit the data, resulting to bad solutions.
GFT v.2 — Regularisation with elastic net
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
+
GFT v.2 — Regularisation with elastic net
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
+
least squares
GFT v.2 — Regularisation with elastic net
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
+
least squares
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
1 2 R+
2 2 R+
1
L1 & L2-norm regularisers for the weights
GFT v.2 — Regularisation with elastic net
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
+
least squares
Encourages sparse models (feature selection)
Handles collinear features (search queries)
Number of selected features is not limited to the number
of samples (n)
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
1 2 R+
2 2 R+
1
L1 & L2-norm regularisers for the weights
GFT v.2 — Regularisation with elastic net
Elastic net
argmin
w,
0
@
NX
i=1
(xiw + yi)2
+ 1
MX
j=1
|wj| + 2
MX
j=1
w2
j
1
A
argmin
w,
0
@
nX
i=1
(xiw + yi)2
+ 1
mX
j=1
|wj| + 2
mX
j=1
w2
j
1
A
Variables
N
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
+
least squares
Encourages sparse models (feature selection)
Handles collinear features (search queries)
Number of selected features is not limited to the number
of samples (n)
M
X 2 RN⇥M
y 2 RN
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
1 2 R+
2 2 R+
1
L1 & L2-norm regularisers for the weights
many weights will
be set to zero!
GFT v.2 — Feature selection
1st layer: Keep search queries that their frequency time
series has a ≥ 0.5 Pearson correlation with the CDC
ILI rates (in the training data)
2nd layer: Elastic net will assign weights equal to 0 to
features (search queries) that are identified as statistically
irrelevant to our task
μ (σ) # queries selected across all training data sets
# queries r ≥ 0.5 GFT Elastic net
49,708 937 (334) 46 (39) 278 (64)
GFT v.2 — Evaluation (1/2)
ut a known cause other than influenza.” ILI rates are published on a weekly b
icating the percentage of ILI prevalence at a national level. We use ILINet’s rate
to train and evaluate our models throughout our work.
ance metrics
= y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we ap
rformance:
on correlation (r), defined by
r =
1
N 1
NX
t=1
✓
yt µ(y)
(y)
◆ ✓
ˆyt µ(ˆy)
(ˆy)
◆
,
µ(·) and (·) are the sample mean and standard deviation respectively.
Absolute Error (MAE), defined by
MAE (ˆy, y) =
1
N
NX
t=1
|ˆyt yt| .
window) indicating the percentage of ILI prevale
(see Fig. S1) to train and evaluate our models thr
Performance metrics
Given a set y = y1, . . . , yN of ground truth value
predictive performance:
• Pearson correlation (r), defined by
r =
N
where µ(·) and (·) are the sample mean a
• Mean Absolute Error (MAE), defined by
• Mean Absolute Percentage of Error (MAP
Target variable:
Estimates:
et y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we a
performance:
rson correlation (r), defined by
r =
1
N 1
NX
t=1
✓
yt µ(y)
(y)
◆ ✓
ˆyt µ(ˆy)
(ˆy)
◆
,
ere µ(·) and (·) are the sample mean and standard deviation respectively.
an Absolute Error (MAE), defined by
MAE (ˆy, y) =
1
N
NX
t=1
|ˆyt yt| .
an Absolute Percentage of Error (MAPE), defined by
MAPE (ˆy, y) =
1
N
NX
t=1
ˆyt yt
yt
.
set y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we
e performance:
arson correlation (r), defined by
r =
1
N 1
NX
t=1
✓
yt µ(y)
(y)
◆ ✓
ˆyt µ(ˆy)
(ˆy)
◆
,
here µ(·) and (·) are the sample mean and standard deviation respectively.
ean Absolute Error (MAE), defined by
MAE (ˆy, y) =
1
N
NX
t=1
|ˆyt yt| .
ean Absolute Percentage of Error (MAPE), defined by
MAPE (ˆy, y) =
1
N
NX
t=1
ˆyt yt
yt
.
N t=1
Mean Absolute Percentage of Error (MAPE), defined by
MAPE (ˆy, y) =
1
N
NX
t=1
ˆyt yt
yt
.
e that performance is measured after reversing the logit transformation (see bel
MSE (ˆy, y) =
1
N
NX
t=1
(ˆyt yt)
2
.Mean Squared Error:
Mean Absolute Error:
Mean Absolute
Percentage of Error:
GFT v.2 — Evaluation (2/2)
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
ILIrate(US)
CDC ILI rates
Elastic Net
GFT v.2 — Evaluation (2/2)
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
ILIrate(US)
CDC ILI rates
Elastic Net
GFT
GFT r = .89, MAE = 3.81·10-3
, MAPE = 20.4%
Elastic net r = .92, MAE = 2.60·10-3
, MAPE = 11.9%
GFT v.2 — Nonlinearities in the data
US ILI rates (CDC) ~ freq. of query ‘flu’
0.5 1 1.5 2 2.5 3 3.5
Frequency of query 'flu' 10
-5
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
ILIrates Raw data
Linear fit
GFT v.2 — Nonlinearities in the data
US ILI rates (CDC) ~ freq. of query ‘flu medicine’
1 2 3 4 5 6
Frequency of query 'flu medicine' 10
-7
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
ILIrates Raw data
Linear fit
GFT v.2 — Nonlinearities in the data
US ILI rates (CDC) ~ freq. of query ‘how long is flu contagious’
0 1 2 3 4 5 6
Frequency of query 'how long is flu contagious' 10
-7
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
ILIrates Raw data
Linear fit
GFT v.2 — Nonlinearities in the data
US ILI rates (CDC) ~ freq. of query ‘how to break a fever’
1 2 3 4 5 6 7 8 9 10
Frequency of query 'how to break a fever' 10
-7
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
ILIrates Raw data
Linear fit
GFT v.2 — Nonlinearities in the data
US ILI rates (CDC) ~ freq. of query ‘sore throat treatment’
0.5 1 1.5 2 2.5 3
Frequency of query 'sore throat treatment' 10
-7
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
ILIrates Raw data
Linear fit
GFT v.2 — Gaussian Processes (1/4)
A Gaussian Process (GP) learns a distribution over
functions that can explain the data
Fully specified by a mean (m) and a covariance (kernel)
function (k); we set m(x) = 0 in our experiments
Collection of random variables any finite number of which
have a multivariate Gaussian distribution
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
GFT v.2 — Gaussian Processes (1/4)
A Gaussian Process (GP) learns a distribution over
functions that can explain the data
Fully specified by a mean (m) and a covariance (kernel)
function (k); we set m(x) = 0 in our experiments
Collection of random variables any finite number of which
have a multivariate Gaussian distribution
E[x] =
∞
−∞
N x|µ, σ2
x dx = µ. (1.49)
Because the parameter µ represents the average value of x under the distribution, it
is referred to as the mean. Similarly, for the second order moment
E[x2
] =
∞
−∞
N x|µ, σ2
x2
dx = µ2
+ σ2
. (1.50)
From (1.49) and (1.50), it follows that the variance of x is given by
var[x] = E[x2
] − E[x]2
= σ2
(1.51)
and hence σ2
is referred to as the variance parameter. The maximum of a distribution
is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9
We are also interested in the Gaussian distribution defined over a D-dimensional
vector x of continuous variables, which is given by
N(x|µ, Σ) =
1
(2π)D/2
1
|Σ|1/2
exp −
1
2
(x − µ)T
Σ−1
(x − µ) (1.52)
where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called
the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the
multivariate Gaussian distribution briefly in this chapter, although its properties will
be studied in detail in Section 2.3.
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
GFT v.2 — Gaussian Processes (1/4)
A Gaussian Process (GP) learns a distribution over
functions that can explain the data
Fully specified by a mean (m) and a covariance (kernel)
function (k); we set m(x) = 0 in our experiments
Collection of random variables any finite number of which
have a multivariate Gaussian distribution
E[x] =
∞
−∞
N x|µ, σ2
x dx = µ. (1.49)
Because the parameter µ represents the average value of x under the distribution, it
is referred to as the mean. Similarly, for the second order moment
E[x2
] =
∞
−∞
N x|µ, σ2
x2
dx = µ2
+ σ2
. (1.50)
From (1.49) and (1.50), it follows that the variance of x is given by
var[x] = E[x2
] − E[x]2
= σ2
(1.51)
and hence σ2
is referred to as the variance parameter. The maximum of a distribution
is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9
We are also interested in the Gaussian distribution defined over a D-dimensional
vector x of continuous variables, which is given by
N(x|µ, Σ) =
1
(2π)D/2
1
|Σ|1/2
exp −
1
2
(x − µ)T
Σ−1
(x − µ) (1.52)
where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called
the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the
multivariate Gaussian distribution briefly in this chapter, although its properties will
be studied in detail in Section 2.3.
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)inference
GFT v.2 — Gaussian Processes (2/4)
1.2 A few basic kernels
To begin understanding the types of structures expressible by GPs, we will start by
briefly examining the priors on functions encoded by some commonly used kernels: the
squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are
defined in figure 1.1.
Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin)
k(x, xÕ
) = ‡2
f exp
1
≠(x≠xÕ)2
2¸2
2
‡2
f exp
1
≠ 2
¸2 sin2
1
fix≠xÕ
p
22
‡2
f (x ≠ c)(xÕ
≠ c)
Plot of k(x, xÕ
):
0 0
0
x ≠ xÕ
x ≠ xÕ
x (with xÕ
= 1)
¿ ¿ ¿
Functions f(x)
sampled from
GP prior:
x x x
Type of structure: local variation repeating structure linear functions
Common GP kernels (covariance functions)
GFT v.2 — Gaussian Processes (3/4)
4 Expressing Structure with Kernels
Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per
0 0
0
0
x (with xÕ
= 1) x ≠ xÕ
x (with xÕ
= 1) x (with xÕ
= 1)
¿ ¿ ¿ ¿
quadratic functions locally periodic increasing variation growing amplitude
Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels.
Adding or multiplying GP kernels
produces a new valid GP kernel
GFT v.2 — Gaussian Processes (4/4)
(x,y) pairs with obvious nonlinear relationship
0 10 20 30 40 50 60
x (predictor variable)
0
5
10
15
20
y(targetvariable)
x,y pairs
GFT v.2 — Gaussian Processes (4/4)
least squares regression (poor solution)
0 10 20 30 40 50 60
x (predictor variable)
0
5
10
15
20
y(targetvariable)
x,y pairs OLS fit (train) OLS fit
GFT v.2 — Gaussian Processes (4/4)
sum of 2 GP kernels (periodic + squared exponential)
0 10 20 30 40 50 60
x (predictor variable)
0
5
10
15
20
y(targetvariable)
x,y pairs OLS fit OLS fit (train) GP fit (train) GP fit
Clustering queries selected by elastic net into C clusters
with k-means
Clusters are determined by using cosine similarity as
the distance metric (on query frequency time series)
Groups queries with similar topicality & usage patterns
GFT v.2 — k-means and GP regression
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)
k(x, x0
) =
CX
i=1
kSE(xci , x0
ci
)
!
+ 2
· (x, x0
)
x = {xc1 , . . . , xc10 }
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u nX
v
u
u nX
f(x) ⇠ GP m(x), k(x, x0
) , x, x0
2 Rm
, f : Rm
! R
f⇤
⇠ N (0, K) , (K)ij = k(xi, xj)
k(x, x0
) =
CX
i=1
kSE(xci , x0
ci
)
!
+ 2
· (x, x0
)
x = {xc1 , . . . , xc10 }
v · u
nX
i=1
viui
y 2 R
X 2 Rn⇥m
xi 2 Rm, i 2 {1, . . . , n}
y 2 Rn
yi 2 R
w 2 Rm
2 R
1 2 R+
2 2 R+
GPs can be defined as sets of random variables, any finite nu
which have a multivariate Gaussian distribution.
In GP regression, for the inputs x, x0 2 RM (both expressing
the observation matrix X) we want to learn a function f: RM ! R
drawn from a GP prior
f(x) ⇠ GP µ(x) = 0, k(x, x0
)
kSE(x, x0
) = 2
exp
✓
kx x0k2
2
2`2
◆
2
clusters
noise
GFT v.2 — Performance
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
ILIrate(US)
CDC ILI rates GP
GFT v.2 — Performance
2009 2010 2011 2012 2013
Time (weeks)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
ILIrate(US)
CDC ILI rates GP Elastic Net
Elastic net r = .92, MAE = 2.60·10-3
, MAPE = 11.9%
GP r = .95, MAE = 2.21·10-3
, MAPE = 10.8%
GFT v.2 — Queries’ added value
MAX model
omponent in the ARMAX function incorporates further information into the model. In all of
e season is fixed to 52 weeks (1-year long). The full model description, which extends Eq. 6 in
yt =
pX
i=1
iyt i +
JX
i=1
!iyt 52 i
| {z }
AR and seasonal AR
+
qX
i=1
✓i✏t i +
KX
i=1
⌫i✏t 52 i
| {z }
MA and seasonal MA
+
DX
i=1
wiht,i
| {z }
regression
+ ✏t ,
i are lagged variable parameters of order J and K, respectively. We estimate a series of mo
el that minimizes the Akaike Information Criterion (AIC);8
therefore all the hyper-parameters a
y observing these parameters, the evidence of seasonality in the signal is far less clear in the
here are fewer samples from previous years) than the evidence in the last prediction period
ny preceding seasons). More precisely, the first few prediction periods do not incorporate a yea
end to incorporate an AR seasonal lag of order 1 and a moving average seasonal lag of order
e, the estimation procedure includes a search over integrated components, augmenting the
ntegrated moving average regression (ARIMA). The integrated part of the model refers to an in
emoving non-stationarities present in the time series. As more evidence of stationarity is p
Autoregression: Combine CDC ILI rates from the
previous week(s) with the ILI rate estimate from search
queries for the current week
Various week lags explored (1, 2,…, 6 weeks)
GFT v.2 — Performance
GFT v.2 — Performance
1-week lag for the CDC data
AR(CDC) r = .97, MAE = 1.87·10-3
, MAPE = 8.2%
AR(CDC,GP) r = .99, MAE = 1.05·10-3
, MAPE = 5.7%
—
2-week lag for the CDC data
AR(CDC) r = .87, MAE = 3.36·10-3
, MAPE = 14.3%
AR(CDC,GP) r = .99, MAE = 1.35·10-3
, MAPE = 7.3%
—
GP r = .95, MAE = 2.21·10-3
, MAPE = 10.8%
GFT v.2 — Non-optimal feature selection
Queries irrelevant to flu are still maintained, e.g. “nba
injury report” or “muscle building supplements”
Feature selection is primarily based on correlation, then
on a linear relationship
Introduce a semantic feature selection

— enhance causal connections (implicitly)

— circumvent the painful training of a classifier
GFT v.3 — Word embeddings
Word embeddings are vectors of a certain dimensionality
(usually from 50 to 1024) that represent words in a corpus
Derive these vectors by predicting contextual word
occurrence in large corpora (word2vec) using a shallow
neural network approach:

— Continuous Bag-Of-Words (CBOW): Predict centre
word from surrounding ones

— skip-gram: Predict surrounding words from centre one
Other methods available: GloVe, fastText
GFT v.3 — Word embedding data sets
Use tweets geolocated in the UK to learn word embeddings
that may capture

— informal language used in searches

— British English language / expressions

— cultural biases
(a)215 million tweets (February 2014 to March 2016), CBOW,
512 dimensions, 137,421 words covered

https://doi.org/10.6084/m9.figshare.4052331.v1
(b)1.1 billion tweets (2012 to 2016), skip-gram, 512
dimensions, 470,194 words covered

https://doi.org/10.6084/m9.figshare.5791650.v1
GFT v.3 — Cosine similarity
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
Least squares
argmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
0 1
word embeddings
GFT v.3 — Cosine similarity
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
Least squares
argmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
0 1
word embeddingscos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’
Least squares
argmin
w,
NX
i=1
(xiw + yi)2
!
nX
!
GFT v.3 — Cosine similarity
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
Least squares
argmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
0 1
word embeddings
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que
max
v
✓
cos(v, ‘king’) ⇥ cos(v, ‘woman’)
cos(v, ‘man’)
◆
) v = ‘queen’
where
cos(·, ·) = (cos(·, ·) + 1) /2
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man
max
v
✓
cos(v, ‘king’) ⇥ cos(v, ‘woman’)
cos(v, ‘man’)
◆
) v
where
cos(·, ·) = (cos(·, ·) + 1) /2where
or
GFT v.3 — Cosine similarity
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
Least squares
argmin
w,
NX
i=1
(xiw + yi)2
!
argmin
w,
nX
i=1
(xiw + yi)2
!
Ridge regression
0 1
word embeddings
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’
cos(v, u) =
v · u
kvkkuk
=
nX
i=1
viui
v
u
u
t
nX
i=1
v2
i
v
u
u
t
nX
i=1
u2
i
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que
max
v
(cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que
max
v
✓
cos(v, ‘king’) ⇥ cos(v, ‘woman’)
cos(v, ‘man’)
◆
) v = ‘queen’
where
cos(·, ·) = (cos(·, ·) + 1) /2
cos(·, ·) = (cos(·, ·) + 1) /2where
or
Positive context
Negative context
GFT v.3 — Analogies in Twitter embd.
The … for … not the … is … ?
woman king man ?
him she he ?
better bad good ?
England Rome London ?
Messi basketball football ?
Guardian Conservatives Labour ?
Trump Europe USA ?
rsv fever skin ?
GFT v.3 — Analogies in Twitter embd.
The … for … not the … is … ?
woman king man queen
him she he ?
better bad good ?
England Rome London ?
Messi basketball football ?
Guardian Conservatives Labour ?
Trump Europe USA ?
rsv fever skin ?
The … for … not the … is … ?
woman king man queen
him she he her
better bad good ?
England Rome London ?
Messi basketball football ?
Guardian Conservatives Labour ?
Trump Europe USA ?
rsv fever skin ?
GFT v.3 — Analogies in Twitter embd.
The … for … not the … is … ?
woman king man queen
him she he her
better bad good worse
England Rome London Italy
Messi basketball football Lebron
Guardian Conservatives Labour Telegraph
Trump Europe USA Farage
rsv fever skin flu
GFT v.3 — Analogies in Twitter embd.
GFT v.3 — Better query selection (1/3)
1. Query embedding = Average token embedding
2. Derive a concept by specifying a positive (P) and a
negative (N) context (sets of n-grams)
3. Rank all queries using their similarity score with this
concept
GFT v.3 — Better query selection (1/3)
1. Query embedding = Average token embedding
2. Derive a concept by specifying a positive (P) and a
negative (N) context (sets of n-grams)
3. Rank all queries using their similarity score with this
concept
ummy ache, nausea, feeling nausea, nausea and vomiting, bloated
ummy, dull stomach ache, heartburn, feeling bloated, aches, belly
che, stomach ache, feeling sleepy, spasms, stomach aches, stomach
che after eating, ache, feeling nauseous, headache and nausea
flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out-
break, epidemic, flu medicine, doctors hospital, flu treatment, in-
fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot,
nfirmary, hospice, tuberculosis, physician, flu vaccination
similarity function:
S (Q, C) =
k
i=1 cos (eQ, ePi )
z
j=1 cos eQ, eNj + γ
. (7)
The numerator and denominator of Eq. 7 are sums of co-
sine similarities between the embedding of the search query
GFT v.3 — Better query selection (1/3)
1. Query embedding = Average token embedding
2. Derive a concept by specifying a positive (P) and a
negative (N) context (sets of n-grams)
3. Rank all queries using their similarity score with this
concept
ummy ache, nausea, feeling nausea, nausea and vomiting, bloated
ummy, dull stomach ache, heartburn, feeling bloated, aches, belly
che, stomach ache, feeling sleepy, spasms, stomach aches, stomach
che after eating, ache, feeling nauseous, headache and nausea
flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out-
break, epidemic, flu medicine, doctors hospital, flu treatment, in-
fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot,
nfirmary, hospice, tuberculosis, physician, flu vaccination
similarity function:
S (Q, C) =
k
i=1 cos (eQ, ePi )
z
j=1 cos eQ, eNj + γ
. (7)
The numerator and denominator of Eq. 7 are sums of co-
sine similarities between the embedding of the search query
query embedding
embedding of a negative
concept n-gram
constant to avoid
division by 0
GFT v.3 — Better query selection (2/3)
Positive context Negative context Most similar queries
#flu
fever
flu
flu medicine
gp
hospital
bieber
ebola
wikipedia
cold flu medicine
flu aches
cold and flu
cold flu symptoms
colds and flu
flu
flu gp
flu hospital
flu medicine
ebola
wikipedia
flu aches
flu
colds and flu
cold and flu
cold flu medicine
GFT v.3 — Better query selection (3/3)
Similarity score (S)
1.6 1.8 2 2.2 2.4 2.6 2.8
Numberofsearchqueries
0
500
1000
1500
2000
Figure 2: Histogram presenting the distribution of
the search query similarity scores (S; see Eq. 7) with
the flu infection concept C1.
decided, perhaps with the assistance of an expert. As de-
scribed in the following sections, the optimal performance is
Table 2: Lin
mance estima
ture selection
embedding ba
SSS >>> µSµSµS |Q|Q|Q
+0 14,7
+σS 5,16
+2σS 1,04
+2.5σS 30
+3σS 69
+3.5σS 7
NA 35,5
context is form
the Twitter has
as more genera
the need for me
Given that the distribution of concept similarity scores
appears to be unimodal, we standard deviations from
the mean (μS+θσS) to determine the selected queries
GFT v.3 — Hybrid feature selection
Embedding based feature selection is an unsupervised
technique, thus non optimal
If we combine it with the previous ways for selecting
features, will we obtain better inference accuracy?
We test 7 feature selection approaches:
similarity → elastic net (1)
correlation → elactic net (2) → GP (3)
similarity → correlation → elastic net (4) → GP (5)
similarity → correlation → GP (6)
correlation → GP (7)
d
h-
e
e-
LI
t
e
i-
r
s
u-
al
s
trix X. By setting µ(x) = 0, a common practice in GP
modelling, we focus only on the kernel function. We use the
Mat´ern covariance function [42] to handle abrupt changes
in the predictors given that the experiments are based on a
sample of the original Google search data. It is defined as
k
(ν)
M (x, x′
) = σ2 21− ν
Γ(ν)
√
2ν
ℓ
r
ν
Kν
√
2ν
ℓ
r , (3)
where Kν is a modified Bessel function, ν is a positive con-
stant,7
ℓ is the lengthscale parameter, σ2
a scaling factor
(variance), and r = x−x′
. We also use a Squared Exponen-
tial (SE) covariance to capture more smooth trends in the
data, defined by
kSE(x, x′
) = σ2
exp −
r2
2ℓ2
. (4)
We have chosen to combine these kernels through a summa-
tion. Note that the summation of GP kernels results in a
new valid GP kernel [54]. An additive kernel allows mod-
-
e
-
I
t
e
-
r
s
-
l
s
a
n
-
y
sample of the original Google search data. It is defined as
k
(ν)
M (x, x′
) = σ2 21− ν
Γ(ν)
√
2ν
ℓ
r
ν
Kν
√
2ν
ℓ
r , (3)
where Kν is a modified Bessel function, ν is a positive con-
stant,7
ℓ is the lengthscale parameter, σ2
a scaling factor
(variance), and r = x−x′
. We also use a Squared Exponen-
tial (SE) covariance to capture more smooth trends in the
data, defined by
kSE(x, x′
) = σ2
exp −
r2
2ℓ2
. (4)
We have chosen to combine these kernels through a summa-
tion. Note that the summation of GP kernels results in a
new valid GP kernel [54]. An additive kernel allows mod-
elling with a sum of independent functions, where each one
can potentially account for a different type of structure in
the data [18]. We are using two Mat´ern functions (ν = 3/2)
in an attempt to model long as well as medium (or short)
e
-
r
s
-
l
s
a
n
-
y
a
e
,
n
y
-
r
d
data, defined by
kSE(x, x′
) = σ2
exp −
r2
2ℓ2
. (4)
We have chosen to combine these kernels through a summa-
tion. Note that the summation of GP kernels results in a
new valid GP kernel [54]. An additive kernel allows mod-
elling with a sum of independent functions, where each one
can potentially account for a different type of structure in
the data [18]. We are using two Mat´ern functions (ν = 3/2)
in an attempt to model long as well as medium (or short)
term irregularities, an SE kernel, and white noise. Thus, the
final kernel is given by
k(x, x′
) =
2
i=1
k
(ν=3/2)
M (x, x′
;σi, ℓi)
+ kSE(x, x′
;σ3 , ℓ3 ) + σ2
4 δ(x, x′
) ,
(5)
where δ is a Kronecker delta function. Parameters (7 in
total) are optimised using the Laplace approximation (under
GFT v.3 — GP model details
Skipped in the interest of time!
If you’re interested,
check Section 3.1 of
https://doi.org/10.1145/3038912.3052622
GFT v.3 — Data & evaluation
weekly frequency of 35,572 search queries (UK)
from 1/1/2007 to 9/08/2015 (449 weeks)
access to a private Google Health Trends API for health-
oriented research
corresponding ILI rates for England (Royal College of
General Practitioners and Public Health England)
test on the last 3 flu seasons in the data (2012-2015)
train on increasing data sets starting from 2007,
using all data prior to a test period
GFT v.3 — Performance (1/3)
(a) similarity → elastic net
(b) correlation → elactic net
(c) similarity → correlation → elastic net
(a) (b) (c)
36.23%
47.15%
61.05%
0.190.21
0.30
0.91
0.880.87
r MAE x 0.1 MAPE
GFT v.3 — Performance (1/3)
(a) similarity → elastic net
(b) correlation → elactic net
(c) similarity → correlation → elastic net
(a) (b) (c)
36.23%
47.15%
61.05%
0.190.21
0.30
0.91
0.880.87
r MAE x 0.1 MAPE
GFT v.3 — Performance (2/3)
Time (weeks)
2013 2014 2015
ILIrateper100,000people
0
10
20
30
RCGP/PHE ILI rates
Elastic Net (correlation-based feature selection)
Elastic Net (hybrid feature selection)
Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under
Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth).
Table 4 shows a few characteristic examples of potentially
misleading queries that are filtered by the hybrid feature se-
lection approach, while previously have received a nonzero
regression weight. Evidently, there exist several queries ir-
relevant to the target theme, referring to specific individu-
als and related activities, different health problems or sea-
sonal topics. The regression weight that these queries re-
ceive tends to constitute a significant proportion of the high-
the performance of the corresponding linear model, provid-
ing an indirect indication for the inappropriateness of the
selected features.
Figure 4 draws a comparison between the inferences of
the best nonlinear and linear models, both of which happen
to use the same feature basis (r>.30 ∩ S>µS+σS). The
GP model provides more smooth estimates and an overall
better balance between stronger and milder flu seasons. It
Elastic net with and without word embeddings filtering
GFT v.3 — Performance (2/3)
Time (weeks)
2013 2014 2015
ILIrateper100,000people
0
10
20
30
RCGP/PHE ILI rates
Elastic Net (correlation-based feature selection)
Elastic Net (hybrid feature selection)
Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under
Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth).
Table 4 shows a few characteristic examples of potentially
misleading queries that are filtered by the hybrid feature se-
lection approach, while previously have received a nonzero
regression weight. Evidently, there exist several queries ir-
relevant to the target theme, referring to specific individu-
als and related activities, different health problems or sea-
sonal topics. The regression weight that these queries re-
ceive tends to constitute a significant proportion of the high-
the performance of the corresponding linear model, provid-
ing an indirect indication for the inappropriateness of the
selected features.
Figure 4 draws a comparison between the inferences of
the best nonlinear and linear models, both of which happen
to use the same feature basis (r>.30 ∩ S>µS+σS). The
GP model provides more smooth estimates and an overall
better balance between stronger and milder flu seasons. It
Elastic net with and without word embeddings filtering
prof. surname (70.3%), name surname (27.2%),
heal the world (21.9%), heating oil (21.2%),
name surname recipes (21%), tlc diet (13.3%),
blood game (12.3%), swine flu vaccine side effects (7.2%)
ratio over highest weight
GFT v.3 — Performance (3/3)
(a) correlation → GP
(b) correlation → elastic net → GP
(c) similarity → correlation → elactic net → GP
(d) similarity → correlation → GP
(a) (b) (c) (d)
25.81%
30.30%
35.88%34.17%
0.160.17
0.230.22
0.940.930.920.89
r MAE x 0.1 MAPE
GFT v.3 — Performance (3/3)
(a) correlation → GP
(b) correlation → elastic net → GP
(c) similarity → correlation → elactic net → GP
(d) similarity → correlation → GP
(a) (b) (c) (d)
25.81%
30.30%
35.88%34.17%
0.160.17
0.230.22
0.940.930.920.89
r MAE x 0.1 MAPE
Multi-task learning
m tasks (problems) t1,…,tm
observations Xt1,yt1,…,Xtm,ytm
learn models fti: Xti → yti jointly (and not independently)
Why?
When tasks are related, multi-task learning is expected to
perform better than learning each task independently
Model learning possible even with a few training
samples
Multi-task learning for disease modelling
m tasks (problems) t1,…,tm
observations Xt1,yt1,…,Xtm,ytm
learn models fti: Xti → yti jointly (and not independently)
Can we improve disease models (flu) from online search:
when sporadic training data are available?
across the geographical regions of a country?
across two different countries?
Multi-task learning GFT (1/5)
Department of Computer Science
University College London
United Kingdom
v.lampos@ucl.ac.uk
Department of Computer Science
University College London
United Kingdom
i.cox@ucl.ac.uk
ing to disease surveil-
on is two-fold. Firstly,
models for various ge-
erent countries — can
such models to assist
c disease surveillance
training data. We ex-
ecically a multi-task
Gaussian Process, and
formulations. We use
nduct experiments on
where both health and
pirical results indicate
well as national mod-
ent on mean absolute
aining data is reduced
models can be obtained,
rvals. Furthermore, in
reports (training data)
ing helps to maintain
locations. Finally, we
ment, where data from
As the historical train-
of multi-task learning
HHS Regions
CA
OR
WA
MT
ID
NV
UT
AZ
WY
CO
NM
TX
OK
KS
NE
SD
ND
MN
WI
IA
IL
MO
AR
LA
MS AL GA
FL
SC
NC
TN
KY
IN
MI
OH
WV
VA
MD
DE
PA
NJ
NY
CT
MA
VT
NH
ME
RI
AK
HI
Region 1 Region 2 Region 3 Region 4 Region 5
Region 6 Region 7 Region 8 Region 9 Region 10
Figure 1: The 10 US regions as specied by the Department
of Health  Human Services (HHS).
1 INTRODUCTION
Online user-generated content contains a signicant amount of
information about the oine behavior or state of users. For the
Can multi-task learning across the 10 US regions help
us improve the national ILI model?
Multi-task learning GFT (1/5)
Can multi-task learning across the 10 US regions help
us improve the national ILI model?
Elastic Net MT Elastic Net GP MT GP
0.250.25
0.350.35
0.970.970.960.96
r MAE
5 years of training data
Multi-task learning GFT (1/5)
Can multi-task learning across the 10 US regions help
us improve the national ILI model?
Elastic Net MT Elastic Net GP MT GP
0.44
0.51
0.46
0.53
0.880.850.870.85
r MAE
1 year of training data
Multi-task learning GFT (2/5)
Can multi-task learning across the 10 US regions help
us improve the regional ILI models?
Elastic Net MT Elastic Net GP MT GP
0.47
0.54
0.490.53
0.870.840.860.85
r MAE
1 year of training data
Multi-task learning GFT (3/5)
Can multi-task learning across the 10 US regions help us
improve regional models under sporadic health
reporting?
Split US regions into two groups, one including the 2
regions with the highest population (4 and 9 in the map),
and the other having the remaining 8 regions
Train and evaluate models for the 8 regions under the
hypothesis that there might exist sporadic health reports
Start downsampling the data from the 8 regions using
burst error sampling (random data blocks removed) with
rate γ (1 no sampling, 0.1 10% sample)
Multi-task learning GFT (3/5)
Can multi-task learning across the 10 US regions help us
improve regional models under sporadic health
reporting?
Split US regions into two groups, one including the 2
regions with the highest population (4 and 9 in the map),
and the other having the remaining 8 regions
Train and evaluate models for the 8 regions under the
hypothesis that there might exist sporadic health reports
Start downsampling the data from the 8 regions using
burst error sampling (random data blocks removed) with
rate γ (1 no sampling, 0.1 10% sample)
ch TheWebConf 2018, 23–27 Apr. 2018, Lyon, France
od-
to
in-
ng
E
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.5
0.6
0.7
0.8
0.9
1.0
MAE
EN
MTEN
GP
MTGP
Figure 5: Comparing the performance of EN (dotted), GP
(dashed), MTEN (dash dot) and MTGP (solid) on estimating
the ILI rates for US HHS Regions (except Regions 4 and 9)
for varying burst error sampling (type C) rates ( ).
Multi-task learning GFT (4/5)
Correlations between US regions induced by the covariance
matrix of the MT GP model
Multi-task learning model seems to be capturing existing
geographical relations
and MTGP (blue) ILI estimates for England under varying training data sizes.
re else. Fig. 5 compares
under this scenario. It
evious experiment still
much less aected by
in single task learning
decreases.
s Countries
whether a stable data
ce a disease model for a
underlying assumption
on language and have
patterns of user search
US and England and as-
rical health reports for
xperiments described in
data, we always assume
past L = 5 years. The
e same, with the follow-
about medication were
heir search frequencies
“robitussin” and “z pak”
ults as in the previous
ls register statistically
ngle task learning ones.
ced, the improvements
to be inferring the trends of the time series correctly, the multi-task
estimates are more close to the actual values of the signal’s peaks.
The results conrm our original hypothesis that data from one
country could improve a disease model for another country with
similar characteristics. This motivates the development of more
advanced transfer learning schemes [41], capable of operating be-
tween countries with dierent languages by overcoming language
barrier problems, using variants of machine translation.
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
US
0.64
0.72
0.80
0.88
0.96
A1
A2
Multi-task learning GFT (5/5)
Can multi-task learning across countries (US, England)
help us improve the ILI model for England?
Elastic Net MT Elastic Net GP MT GP
0.47
0.60
0.49
0.70
0.900.890.900.89
r MAE
5 years of training data
Multi-task learning GFT (5/5)
Can multi-task learning across countries (US, England)
help us improve the ILI model for England?
Elastic Net MT Elastic Net GP MT GP
0.59
0.98
0.60
1.00
0.860.850.860.84
r MAE
1 year of training data
Conclusions
Online (user-generated) data can help us improve our
current understanding about public health matters
The original Google Flu Trends was based on a good idea,
but on very limited modelling effort, resulting to major
errors
Subsequent models improved the statistical modelling as
well as the semantic disambiguation between possible
features and delivered better / more robust performance
Multi-task learning improves disease models further
Future direction: Models without strong supervision
Acknowledgements
Industrial partners
— Microsoft Research (Elad Yom-Tov)
— Google
Public health organisations
— Public Health England
— Royal College of General Practitioners
Funding: EPSRC (“i-sense”)
Collaborators: Andrew Miller, Bin Zou, Ingemar J. Cox
Thank you.
Vasileios Lampos (a.k.a. Bill)
Computer Science
University College London
@lampos
References
Ginsberg et al. Detecting influenza epidemics using search
engine query data. Nature 457, pp. 1012—1014 (2009).
Lampos, Miller, Crossan and Stefansen. Advances in
nowcasting influenza-like illness rates using search query
logs. Scientific Reports 5, 12760 (2015).
Lampos, Zou and Cox. Enhancing feature selection using
word embeddings: The case of flu surveillance. WWW ’17,
pp. 695–704 (2017).
Zou, Lampos and Cox. Multi-task learning improves disease
models from Web search. WWW ’18, In Press (2018).
GFT v.1
GFT v.2
GFT v.3
MTL

Weitere ähnliche Inhalte

Ähnlich wie Mining online data for public health surveillance

Peds journal club 2 -diagnostic test
Peds journal club 2 -diagnostic testPeds journal club 2 -diagnostic test
Peds journal club 2 -diagnostic test
coolboy101pk
 
The standard deviation of the diameter at breast height, or DBH, o.docx
The standard deviation of the diameter at breast height, or DBH, o.docxThe standard deviation of the diameter at breast height, or DBH, o.docx
The standard deviation of the diameter at breast height, or DBH, o.docx
christalgrieg
 
Case StudiesMemorial HospitalMemorial Hospital is a privately .docx
Case StudiesMemorial HospitalMemorial Hospital is a privately .docxCase StudiesMemorial HospitalMemorial Hospital is a privately .docx
Case StudiesMemorial HospitalMemorial Hospital is a privately .docx
tidwellveronique
 

Ähnlich wie Mining online data for public health surveillance (20)

From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic
 From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic
From Experimental to Applied Predictive Analytics on Big Data - Milan Vukicevic
 
Peds journal club 2 -diagnostic test
Peds journal club 2 -diagnostic testPeds journal club 2 -diagnostic test
Peds journal club 2 -diagnostic test
 
clinical trial and study design copy 2.pptx
clinical trial and study design copy 2.pptxclinical trial and study design copy 2.pptx
clinical trial and study design copy 2.pptx
 
Lilford. Barach.Research_Methods_Reporting (2)
Lilford. Barach.Research_Methods_Reporting (2)Lilford. Barach.Research_Methods_Reporting (2)
Lilford. Barach.Research_Methods_Reporting (2)
 
Novelties in social science statistics
Novelties in social science statisticsNovelties in social science statistics
Novelties in social science statistics
 
Chi‑square test
Chi‑square test Chi‑square test
Chi‑square test
 
ders 5 hypothesis testing.pptx
ders 5 hypothesis testing.pptxders 5 hypothesis testing.pptx
ders 5 hypothesis testing.pptx
 
Economic analysis on art task shifting ppp (for athens conference)
Economic analysis on art task shifting ppp (for athens conference)Economic analysis on art task shifting ppp (for athens conference)
Economic analysis on art task shifting ppp (for athens conference)
 
Statistical tools in research
Statistical tools in researchStatistical tools in research
Statistical tools in research
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
 
Basics of Hypothesis Testing
Basics of Hypothesis TestingBasics of Hypothesis Testing
Basics of Hypothesis Testing
 
Chi square
Chi square Chi square
Chi square
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Copenhagen 23.10.2008
Copenhagen 23.10.2008Copenhagen 23.10.2008
Copenhagen 23.10.2008
 
The Lachman Test
The Lachman TestThe Lachman Test
The Lachman Test
 
The standard deviation of the diameter at breast height, or DBH, o.docx
The standard deviation of the diameter at breast height, or DBH, o.docxThe standard deviation of the diameter at breast height, or DBH, o.docx
The standard deviation of the diameter at breast height, or DBH, o.docx
 
Case StudiesMemorial HospitalMemorial Hospital is a privately .docx
Case StudiesMemorial HospitalMemorial Hospital is a privately .docxCase StudiesMemorial HospitalMemorial Hospital is a privately .docx
Case StudiesMemorial HospitalMemorial Hospital is a privately .docx
 
Chi square test final
Chi square test finalChi square test final
Chi square test final
 
Freq distribution
Freq distributionFreq distribution
Freq distribution
 
Basics of Sample Size Estimation
Basics of Sample Size EstimationBasics of Sample Size Estimation
Basics of Sample Size Estimation
 

Mehr von Vasileios Lampos

Mehr von Vasileios Lampos (10)

Quicksort
QuicksortQuicksort
Quicksort
 
Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...Transfer learning for unsupervised influenza-like illness models from online ...
Transfer learning for unsupervised influenza-like illness models from online ...
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentMining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
 
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
 
An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...An introduction to digital health surveillance from online user-generated con...
An introduction to digital health surveillance from online user-generated con...
 
User-generated content: collective and personalised inference tasks
User-generated content: collective and personalised inference tasksUser-generated content: collective and personalised inference tasks
User-generated content: collective and personalised inference tasks
 
Bilinear text regression and applications
Bilinear text regression and applicationsBilinear text regression and applications
Bilinear text regression and applications
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual data
 
Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...Assessing the impact of a health intervention via user-generated Internet con...
Assessing the impact of a health intervention via user-generated Internet con...
 

Kürzlich hochgeladen

Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
dishamehta3332
 
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
Sheetaleventcompany
 
Electrocardiogram (ECG) physiological basis .pdf
Electrocardiogram (ECG) physiological basis .pdfElectrocardiogram (ECG) physiological basis .pdf
Electrocardiogram (ECG) physiological basis .pdf
MedicoseAcademics
 
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Sheetaleventcompany
 
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Sheetaleventcompany
 
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
Sheetaleventcompany
 
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
Sheetaleventcompany
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Sheetaleventcompany
 
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
Sheetaleventcompany
 
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Sheetaleventcompany
 
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan CytotecJual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
jualobat34
 

Kürzlich hochgeladen (20)

Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore ₹7.5k Pick Up & Drop With Cash Payment 63...
 
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} ❤️VVIP ROCKY Call Girl in Dehradun U...
 
tongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacytongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacy
 
Electrocardiogram (ECG) physiological basis .pdf
Electrocardiogram (ECG) physiological basis .pdfElectrocardiogram (ECG) physiological basis .pdf
Electrocardiogram (ECG) physiological basis .pdf
 
Intramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptxIntramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptx
 
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
 
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
 
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
 
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
 
Kolkata Call Girls Shobhabazar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Gir...
Kolkata Call Girls Shobhabazar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Gir...Kolkata Call Girls Shobhabazar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Gir...
Kolkata Call Girls Shobhabazar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Gir...
 
Circulatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanismsCirculatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanisms
 
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
 
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
Jaipur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Jaipur No💰...
 
❤️Call Girl Service In Chandigarh☎️9814379184☎️ Call Girl in Chandigarh☎️ Cha...
❤️Call Girl Service In Chandigarh☎️9814379184☎️ Call Girl in Chandigarh☎️ Cha...❤️Call Girl Service In Chandigarh☎️9814379184☎️ Call Girl in Chandigarh☎️ Cha...
❤️Call Girl Service In Chandigarh☎️9814379184☎️ Call Girl in Chandigarh☎️ Cha...
 
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room DeliveryCall 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
 
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
Nagpur Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Nagpur No💰...
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
 
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
 
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} ❤️VVIP POOJA Call Girls in Nagpur Maha...
 
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan CytotecJual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
 

Mining online data for public health surveillance

  • 1. Mining online data for public health surveillance Vasileios Lampos (a.k.a. Bill) Computer Science University College London @lampos
  • 2. Structure ➡ Using online data for health applications ➡ From web searches to syndromic surveillance i. Google Flu Trends: original failure and correction ii. Better feature selection using semantic concepts iii. Snapshot: Multi-task learning for disease models
  • 4. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  • 5. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  • 6. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  • 7. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  • 8. Online data for health (2/3) Google Flu Trends (discontinued)
  • 9. Online data for health (2/3) Flu Detector, fludetector.cs.ucl.ac.uk
  • 10. Online data for health (2/3) Health Map, healthmap.org
  • 11. Online data for health (3/3) Health intervention Impact?
  • 12. Online data for health (3/3) Health intervention Impact?
  • 13. Online data for health (3/3) Vaccinations against flu Impact? Lampos, Yom-Tov, Pebody, Cox (2015) doi:10.1007/s10618-015-0427-9 Wagner, Lampos, Yom-Tov, Pebody, Cox (2017) doi:10.2196/jmir.8184
  • 14. Google Flu Trends (GFT) 0 10 20 30 40 50 60 70 80 90 100 0.90 45 queries 0.85 0.95 Meancorrelation 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a NATURE|Vol 457|19 February 2009 LETTERS
  • 15. Google Flu Trends (GFT) 0 10 20 30 40 50 60 70 80 90 100 0.90 45 queries 0.85 0.95 Meancorrelation 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a NATURE|Vol 457|19 February 2009 LETTERS f ?
  • 16. GFT — Supervised learning Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝn⨯m ⟶ y ∈ ℝn
  • 17. GFT — Supervised learning Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝn⨯m ⟶ y ∈ ℝn count of qi total count of all queries frequency of qi =
  • 18. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  • 19. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  • 20. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  • 21. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  • 22. GFT v.1 — “Logit”, why? the logit function nts the space of possible query fractions and ILI percentages, T and and queries respectively. For a certain week, ∈y denotes the IL ery volumes; for a set of T weeks, all query volumes are represent y analysis found that pairwise relationships between query rates an e logit space, motivating the use of this transformation across all e S3); related work also followed the same modeling principle13,23 = ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the se manner; similarly X denotes the logit-transformed input matri alues for a particular week t. Predictions made by the presented m tion before analysis. proaches for search query modeling proposed linear functions o 3 selected search queries. In particular, GFT’s regression model r β+ w·z+ ε, where the single covariate z denotes the logit-transfo y of a set of queries, w is a weight coefficient we aim to learn, β de dependent, zero-centered noise. The set of queries is selected thro see Supplementary Information [SI], Feature selection in the GFT m this basic model mispredicted ILI in several flu seasons, with signi α ∈ (0,1) 0 2 4 6 8 10 12 x 0 0.2 0.4 0.6 0.8 1 y (x,y) pair values 0 2 4 6 8 10 12 x -10 -5 0 5 10 logit(y) (x,logit(y)) pair values
  • 23. GFT v.1 — “Logit”, why? the logit function nts the space of possible query fractions and ILI percentages, T and and queries respectively. For a certain week, ∈y denotes the IL ery volumes; for a set of T weeks, all query volumes are represent y analysis found that pairwise relationships between query rates an e logit space, motivating the use of this transformation across all e S3); related work also followed the same modeling principle13,23 = ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the se manner; similarly X denotes the logit-transformed input matri alues for a particular week t. Predictions made by the presented m tion before analysis. proaches for search query modeling proposed linear functions o 3 selected search queries. In particular, GFT’s regression model r β+ w·z+ ε, where the single covariate z denotes the logit-transfo y of a set of queries, w is a weight coefficient we aim to learn, β de dependent, zero-centered noise. The set of queries is selected thro see Supplementary Information [SI], Feature selection in the GFT m this basic model mispredicted ILI in several flu seasons, with signi α ∈ (0,1) 0 2 4 6 8 10 12 x 0 0.2 0.4 0.6 0.8 1 y (x,y) pair values 0 2 4 6 8 10 12 x -10 -5 0 5 10 logit(y) (x,logit(y)) pair values 0 2 4 6 8 10 12 x -3 -2 -1 0 1 2 yorlogit(y) y logit(y) values close to 0.5 are “squashed” border values (close to 0 or 1) are “emphasised” z-scored
  • 24. GFT v.1 — Data 9 US regions considered 50 million search queries (most frequent) geolocated in these 9 US regions Weekly ILI rates from CDC 170 weeks, 28/9/2003 to 11/5/2008 with ILI rate > 0 First 128 weeks: Training, 9 x 128 = 1,152 samples Last 42 weeks: Testing (per region)
  • 25. GFT v.1 — Feature selection (1/2) 1. Single query flu models are trained for each US region
 50 million queries x 9 US regions = 450 million models 2. Inference accuracy is estimated for each query using linear correlation (Pearson) as the metric 3. Starting from the best performing query, adding up one query each time, a new model is trained and evaluated
  • 26. 1. Single query flu models are trained for each US region
 50 million queries x 9 US regions = 450 million models 2. Inference accuracy is estimated for each query using linear correlation (Pearson) as the metric 3. Starting from the best performing query, adding up one query each time, a new model is trained and evaluated GFT v.1 — Feature selection (1/2) 0 10 20 30 40 50 60 70 80 90 100 0.90 Number of queries 45 queries 0.85 0.95 Meancorrelation Figure 1 | An evaluation of how many top-scoring queries to include in the 2004 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A com (black) against C which the model over 128 points f correlation of 0.9 indicate 95% pre NATURE|Vol 457|19 February 2009
  • 27. GFT v.1 — Feature selection (2/2) Because localized influenza surveillance is particularly use public health planning, we sought to validate further our Table 1 | Topics found in search queries which were found to be m related with CDC ILI data Search query topic Top 45 queries Next 55 quer n Weighted n W Influenza complication 11 18.15 5 3. Cold/flu remedy 8 5.05 6 5. General influenza symptoms 5 2.60 1 0. Term for influenza 4 3.74 6 0. Specific influenza symptom 4 2.54 6 3. Symptoms of an influenza complication 4 2.21 2 0. Antibiotic medication 3 6.23 3 3. General influenza remedies 2 0.18 1 0. Symptoms of a related disease 2 1.66 2 0. Antiviral medication 1 0.39 1 0. Related disease 1 6.66 3 3. Unrelated to influenza 0 0.00 19 28 Total 45 49.40 55 50 The top 45 queries were used in our final model; the next 55 queries are presented comparison purposes. The number of queries in each topic is indicated, as well as q
  • 28. GFT v.1 — Performance (1/2) Evaluated on 42 weeks (per region) from 2007-2008 Evaluation metric: Pearson correlation μ(r) = .97 with min(r) = .92 and max(r) = .99 Performance looked great at the time, but this is not a proper performance evaluation!
 
 Why?
 Potentially misleading metric (not the loss function here) and rather small testing time span (< 1 flu season)
  • 29. GFT v.1 — Performance (2/2) 100 the p 45 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a correlation of 0.96 was obtained over 42 validation points. Dotted lines indicate 95% prediction intervals. The region comprises New York, New Jersey and Pennsylvania. LETTERS Mid-Atlantic US region Pearson correlation, r = .96
  • 30. GFT v.2 — Data & evaluation weekly frequency of 49,708 search queries (US) filtered by a relaxed health topic classifier, intersection of frequent queries across all US regions from 4/1/2004 to 28/12/2013 (521 weeks) corresponding weekly US ILI rates from CDC test on 5 flu seasons, 5 year-long test sets (2008-13) train on increasing data sets starting from 2004, using all data prior to a test period
  • 31. GFT v.1 was simple to a (significant) fault 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates GFT
  • 32. 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates GFT “rsv” — 25% “flu symptoms” — 18% “benzonatate” — 6% “symptoms of pneumonia” — 6% “upper respiratory infection” — 4% GFT v.1 was simple to a (significant) fault
  • 33. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  • 34. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A argmin w, @ i=1 (xiw + yi)2 + 1 j=1 |wj| + 2 j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i argmin w, @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  • 35. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R argmin w, @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  • 36. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  • 37. GFT v.2 — Linear multivariate regression X argmin w, nX i=1 (xiw + yi)2 ! X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least Squares ⚠ ☣ ⚠ Least squares regression is not applicable here because we have very few training samples (n) but many features (search queries; m). Models derived from least squares will tend to overfit the data, resulting to bad solutions.
  • 38. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R +
  • 39. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares
  • 40. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights
  • 41. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number of samples (n) M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights
  • 42. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number of samples (n) M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights many weights will be set to zero!
  • 43. GFT v.2 — Feature selection 1st layer: Keep search queries that their frequency time series has a ≥ 0.5 Pearson correlation with the CDC ILI rates (in the training data) 2nd layer: Elastic net will assign weights equal to 0 to features (search queries) that are identified as statistically irrelevant to our task μ (σ) # queries selected across all training data sets # queries r ≥ 0.5 GFT Elastic net 49,708 937 (334) 46 (39) 278 (64)
  • 44. GFT v.2 — Evaluation (1/2) ut a known cause other than influenza.” ILI rates are published on a weekly b icating the percentage of ILI prevalence at a national level. We use ILINet’s rate to train and evaluate our models throughout our work. ance metrics = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we ap rformance: on correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , µ(·) and (·) are the sample mean and standard deviation respectively. Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . window) indicating the percentage of ILI prevale (see Fig. S1) to train and evaluate our models thr Performance metrics Given a set y = y1, . . . , yN of ground truth value predictive performance: • Pearson correlation (r), defined by r = N where µ(·) and (·) are the sample mean a • Mean Absolute Error (MAE), defined by • Mean Absolute Percentage of Error (MAP Target variable: Estimates: et y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we a performance: rson correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , ere µ(·) and (·) are the sample mean and standard deviation respectively. an Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . an Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . set y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we e performance: arson correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , here µ(·) and (·) are the sample mean and standard deviation respectively. ean Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . ean Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . N t=1 Mean Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . e that performance is measured after reversing the logit transformation (see bel MSE (ˆy, y) = 1 N NX t=1 (ˆyt yt) 2 .Mean Squared Error: Mean Absolute Error: Mean Absolute Percentage of Error:
  • 45. GFT v.2 — Evaluation (2/2) 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates Elastic Net
  • 46. GFT v.2 — Evaluation (2/2) 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates Elastic Net GFT GFT r = .89, MAE = 3.81·10-3 , MAPE = 20.4% Elastic net r = .92, MAE = 2.60·10-3 , MAPE = 11.9%
  • 47. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘flu’ 0.5 1 1.5 2 2.5 3 3.5 Frequency of query 'flu' 10 -5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  • 48. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘flu medicine’ 1 2 3 4 5 6 Frequency of query 'flu medicine' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  • 49. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘how long is flu contagious’ 0 1 2 3 4 5 6 Frequency of query 'how long is flu contagious' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  • 50. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘how to break a fever’ 1 2 3 4 5 6 7 8 9 10 Frequency of query 'how to break a fever' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  • 51. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘sore throat treatment’ 0.5 1 1.5 2 2.5 3 Frequency of query 'sore throat treatment' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  • 52. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i
  • 53. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution E[x] = ∞ −∞ N x|µ, σ2 x dx = µ. (1.49) Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment E[x2 ] = ∞ −∞ N x|µ, σ2 x2 dx = µ2 + σ2 . (1.50) From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2 ] − E[x]2 = σ2 (1.51) and hence σ2 is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9 We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables, which is given by N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp − 1 2 (x − µ)T Σ−1 (x − µ) (1.52) where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3. f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i
  • 54. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution E[x] = ∞ −∞ N x|µ, σ2 x dx = µ. (1.49) Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment E[x2 ] = ∞ −∞ N x|µ, σ2 x2 dx = µ2 + σ2 . (1.50) From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2 ] − E[x]2 = σ2 (1.51) and hence σ2 is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9 We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables, which is given by N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp − 1 2 (x − µ)T Σ−1 (x − µ) (1.52) where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3. f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj)inference
  • 55. GFT v.2 — Gaussian Processes (2/4) 1.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are defined in figure 1.1. Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, xÕ ) = ‡2 f exp 1 ≠(x≠xÕ)2 2¸2 2 ‡2 f exp 1 ≠ 2 ¸2 sin2 1 fix≠xÕ p 22 ‡2 f (x ≠ c)(xÕ ≠ c) Plot of k(x, xÕ ): 0 0 0 x ≠ xÕ x ≠ xÕ x (with xÕ = 1) ¿ ¿ ¿ Functions f(x) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions Common GP kernels (covariance functions)
  • 56. GFT v.2 — Gaussian Processes (3/4) 4 Expressing Structure with Kernels Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per 0 0 0 0 x (with xÕ = 1) x ≠ xÕ x (with xÕ = 1) x (with xÕ = 1) ¿ ¿ ¿ ¿ quadratic functions locally periodic increasing variation growing amplitude Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels. Adding or multiplying GP kernels produces a new valid GP kernel
  • 57. GFT v.2 — Gaussian Processes (4/4) (x,y) pairs with obvious nonlinear relationship 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs
  • 58. GFT v.2 — Gaussian Processes (4/4) least squares regression (poor solution) 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs OLS fit (train) OLS fit
  • 59. GFT v.2 — Gaussian Processes (4/4) sum of 2 GP kernels (periodic + squared exponential) 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs OLS fit OLS fit (train) GP fit (train) GP fit
  • 60. Clustering queries selected by elastic net into C clusters with k-means Clusters are determined by using cosine similarity as the distance metric (on query frequency time series) Groups queries with similar topicality & usage patterns GFT v.2 — k-means and GP regression f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) k(x, x0 ) = CX i=1 kSE(xci , x0 ci ) ! + 2 · (x, x0 ) x = {xc1 , . . . , xc10 } cos(v, u) = v · u kvkkuk = nX i=1 viui v u u nX v u u nX f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) k(x, x0 ) = CX i=1 kSE(xci , x0 ci ) ! + 2 · (x, x0 ) x = {xc1 , . . . , xc10 } v · u nX i=1 viui y 2 R X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ GPs can be defined as sets of random variables, any finite nu which have a multivariate Gaussian distribution. In GP regression, for the inputs x, x0 2 RM (both expressing the observation matrix X) we want to learn a function f: RM ! R drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ 2 clusters noise
  • 61. GFT v.2 — Performance 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ILIrate(US) CDC ILI rates GP
  • 62. GFT v.2 — Performance 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ILIrate(US) CDC ILI rates GP Elastic Net Elastic net r = .92, MAE = 2.60·10-3 , MAPE = 11.9% GP r = .95, MAE = 2.21·10-3 , MAPE = 10.8%
  • 63. GFT v.2 — Queries’ added value MAX model omponent in the ARMAX function incorporates further information into the model. In all of e season is fixed to 52 weeks (1-year long). The full model description, which extends Eq. 6 in yt = pX i=1 iyt i + JX i=1 !iyt 52 i | {z } AR and seasonal AR + qX i=1 ✓i✏t i + KX i=1 ⌫i✏t 52 i | {z } MA and seasonal MA + DX i=1 wiht,i | {z } regression + ✏t , i are lagged variable parameters of order J and K, respectively. We estimate a series of mo el that minimizes the Akaike Information Criterion (AIC);8 therefore all the hyper-parameters a y observing these parameters, the evidence of seasonality in the signal is far less clear in the here are fewer samples from previous years) than the evidence in the last prediction period ny preceding seasons). More precisely, the first few prediction periods do not incorporate a yea end to incorporate an AR seasonal lag of order 1 and a moving average seasonal lag of order e, the estimation procedure includes a search over integrated components, augmenting the ntegrated moving average regression (ARIMA). The integrated part of the model refers to an in emoving non-stationarities present in the time series. As more evidence of stationarity is p Autoregression: Combine CDC ILI rates from the previous week(s) with the ILI rate estimate from search queries for the current week Various week lags explored (1, 2,…, 6 weeks)
  • 64. GFT v.2 — Performance
  • 65. GFT v.2 — Performance 1-week lag for the CDC data AR(CDC) r = .97, MAE = 1.87·10-3 , MAPE = 8.2% AR(CDC,GP) r = .99, MAE = 1.05·10-3 , MAPE = 5.7% — 2-week lag for the CDC data AR(CDC) r = .87, MAE = 3.36·10-3 , MAPE = 14.3% AR(CDC,GP) r = .99, MAE = 1.35·10-3 , MAPE = 7.3% — GP r = .95, MAE = 2.21·10-3 , MAPE = 10.8%
  • 66. GFT v.2 — Non-optimal feature selection Queries irrelevant to flu are still maintained, e.g. “nba injury report” or “muscle building supplements” Feature selection is primarily based on correlation, then on a linear relationship Introduce a semantic feature selection
 — enhance causal connections (implicitly)
 — circumvent the painful training of a classifier
  • 67. GFT v.3 — Word embeddings Word embeddings are vectors of a certain dimensionality (usually from 50 to 1024) that represent words in a corpus Derive these vectors by predicting contextual word occurrence in large corpora (word2vec) using a shallow neural network approach:
 — Continuous Bag-Of-Words (CBOW): Predict centre word from surrounding ones
 — skip-gram: Predict surrounding words from centre one Other methods available: GloVe, fastText
  • 68. GFT v.3 — Word embedding data sets Use tweets geolocated in the UK to learn word embeddings that may capture
 — informal language used in searches
 — British English language / expressions
 — cultural biases (a)215 million tweets (February 2014 to March 2016), CBOW, 512 dimensions, 137,421 words covered
 https://doi.org/10.6084/m9.figshare.4052331.v1 (b)1.1 billion tweets (2012 to 2016), skip-gram, 512 dimensions, 470,194 words covered
 https://doi.org/10.6084/m9.figshare.5791650.v1
  • 69. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings
  • 70. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddingscos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ Least squares argmin w, NX i=1 (xiw + yi)2 ! nX !
  • 71. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ where cos(·, ·) = (cos(·, ·) + 1) /2 cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v where cos(·, ·) = (cos(·, ·) + 1) /2where or
  • 72. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ where cos(·, ·) = (cos(·, ·) + 1) /2 cos(·, ·) = (cos(·, ·) + 1) /2where or Positive context Negative context
  • 73. GFT v.3 — Analogies in Twitter embd. The … for … not the … is … ? woman king man ? him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?
  • 74. GFT v.3 — Analogies in Twitter embd. The … for … not the … is … ? woman king man queen him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?
  • 75. The … for … not the … is … ? woman king man queen him she he her better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ? GFT v.3 — Analogies in Twitter embd.
  • 76. The … for … not the … is … ? woman king man queen him she he her better bad good worse England Rome London Italy Messi basketball football Lebron Guardian Conservatives Labour Telegraph Trump Europe USA Farage rsv fever skin flu GFT v.3 — Analogies in Twitter embd.
  • 77. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept
  • 78. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept ummy ache, nausea, feeling nausea, nausea and vomiting, bloated ummy, dull stomach ache, heartburn, feeling bloated, aches, belly che, stomach ache, feeling sleepy, spasms, stomach aches, stomach che after eating, ache, feeling nauseous, headache and nausea flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out- break, epidemic, flu medicine, doctors hospital, flu treatment, in- fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot, nfirmary, hospice, tuberculosis, physician, flu vaccination similarity function: S (Q, C) = k i=1 cos (eQ, ePi ) z j=1 cos eQ, eNj + γ . (7) The numerator and denominator of Eq. 7 are sums of co- sine similarities between the embedding of the search query
  • 79. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept ummy ache, nausea, feeling nausea, nausea and vomiting, bloated ummy, dull stomach ache, heartburn, feeling bloated, aches, belly che, stomach ache, feeling sleepy, spasms, stomach aches, stomach che after eating, ache, feeling nauseous, headache and nausea flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out- break, epidemic, flu medicine, doctors hospital, flu treatment, in- fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot, nfirmary, hospice, tuberculosis, physician, flu vaccination similarity function: S (Q, C) = k i=1 cos (eQ, ePi ) z j=1 cos eQ, eNj + γ . (7) The numerator and denominator of Eq. 7 are sums of co- sine similarities between the embedding of the search query query embedding embedding of a negative concept n-gram constant to avoid division by 0
  • 80. GFT v.3 — Better query selection (2/3) Positive context Negative context Most similar queries #flu fever flu flu medicine gp hospital bieber ebola wikipedia cold flu medicine flu aches cold and flu cold flu symptoms colds and flu flu flu gp flu hospital flu medicine ebola wikipedia flu aches flu colds and flu cold and flu cold flu medicine
  • 81. GFT v.3 — Better query selection (3/3) Similarity score (S) 1.6 1.8 2 2.2 2.4 2.6 2.8 Numberofsearchqueries 0 500 1000 1500 2000 Figure 2: Histogram presenting the distribution of the search query similarity scores (S; see Eq. 7) with the flu infection concept C1. decided, perhaps with the assistance of an expert. As de- scribed in the following sections, the optimal performance is Table 2: Lin mance estima ture selection embedding ba SSS >>> µSµSµS |Q|Q|Q +0 14,7 +σS 5,16 +2σS 1,04 +2.5σS 30 +3σS 69 +3.5σS 7 NA 35,5 context is form the Twitter has as more genera the need for me Given that the distribution of concept similarity scores appears to be unimodal, we standard deviations from the mean (μS+θσS) to determine the selected queries
  • 82. GFT v.3 — Hybrid feature selection Embedding based feature selection is an unsupervised technique, thus non optimal If we combine it with the previous ways for selecting features, will we obtain better inference accuracy? We test 7 feature selection approaches: similarity → elastic net (1) correlation → elactic net (2) → GP (3) similarity → correlation → elastic net (4) → GP (5) similarity → correlation → GP (6) correlation → GP (7)
  • 83. d h- e e- LI t e i- r s u- al s trix X. By setting µ(x) = 0, a common practice in GP modelling, we focus only on the kernel function. We use the Mat´ern covariance function [42] to handle abrupt changes in the predictors given that the experiments are based on a sample of the original Google search data. It is defined as k (ν) M (x, x′ ) = σ2 21− ν Γ(ν) √ 2ν ℓ r ν Kν √ 2ν ℓ r , (3) where Kν is a modified Bessel function, ν is a positive con- stant,7 ℓ is the lengthscale parameter, σ2 a scaling factor (variance), and r = x−x′ . We also use a Squared Exponen- tial (SE) covariance to capture more smooth trends in the data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- - e - I t e - r s - l s a n - y sample of the original Google search data. It is defined as k (ν) M (x, x′ ) = σ2 21− ν Γ(ν) √ 2ν ℓ r ν Kν √ 2ν ℓ r , (3) where Kν is a modified Bessel function, ν is a positive con- stant,7 ℓ is the lengthscale parameter, σ2 a scaling factor (variance), and r = x−x′ . We also use a Squared Exponen- tial (SE) covariance to capture more smooth trends in the data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- elling with a sum of independent functions, where each one can potentially account for a different type of structure in the data [18]. We are using two Mat´ern functions (ν = 3/2) in an attempt to model long as well as medium (or short) e - r s - l s a n - y a e , n y - r d data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- elling with a sum of independent functions, where each one can potentially account for a different type of structure in the data [18]. We are using two Mat´ern functions (ν = 3/2) in an attempt to model long as well as medium (or short) term irregularities, an SE kernel, and white noise. Thus, the final kernel is given by k(x, x′ ) = 2 i=1 k (ν=3/2) M (x, x′ ;σi, ℓi) + kSE(x, x′ ;σ3 , ℓ3 ) + σ2 4 δ(x, x′ ) , (5) where δ is a Kronecker delta function. Parameters (7 in total) are optimised using the Laplace approximation (under GFT v.3 — GP model details Skipped in the interest of time! If you’re interested, check Section 3.1 of https://doi.org/10.1145/3038912.3052622
  • 84. GFT v.3 — Data & evaluation weekly frequency of 35,572 search queries (UK) from 1/1/2007 to 9/08/2015 (449 weeks) access to a private Google Health Trends API for health- oriented research corresponding ILI rates for England (Royal College of General Practitioners and Public Health England) test on the last 3 flu seasons in the data (2012-2015) train on increasing data sets starting from 2007, using all data prior to a test period
  • 85. GFT v.3 — Performance (1/3) (a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net (a) (b) (c) 36.23% 47.15% 61.05% 0.190.21 0.30 0.91 0.880.87 r MAE x 0.1 MAPE
  • 86. GFT v.3 — Performance (1/3) (a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net (a) (b) (c) 36.23% 47.15% 61.05% 0.190.21 0.30 0.91 0.880.87 r MAE x 0.1 MAPE
  • 87. GFT v.3 — Performance (2/3) Time (weeks) 2013 2014 2015 ILIrateper100,000people 0 10 20 30 RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection) Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth). Table 4 shows a few characteristic examples of potentially misleading queries that are filtered by the hybrid feature se- lection approach, while previously have received a nonzero regression weight. Evidently, there exist several queries ir- relevant to the target theme, referring to specific individu- als and related activities, different health problems or sea- sonal topics. The regression weight that these queries re- ceive tends to constitute a significant proportion of the high- the performance of the corresponding linear model, provid- ing an indirect indication for the inappropriateness of the selected features. Figure 4 draws a comparison between the inferences of the best nonlinear and linear models, both of which happen to use the same feature basis (r>.30 ∩ S>µS+σS). The GP model provides more smooth estimates and an overall better balance between stronger and milder flu seasons. It Elastic net with and without word embeddings filtering
  • 88. GFT v.3 — Performance (2/3) Time (weeks) 2013 2014 2015 ILIrateper100,000people 0 10 20 30 RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection) Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth). Table 4 shows a few characteristic examples of potentially misleading queries that are filtered by the hybrid feature se- lection approach, while previously have received a nonzero regression weight. Evidently, there exist several queries ir- relevant to the target theme, referring to specific individu- als and related activities, different health problems or sea- sonal topics. The regression weight that these queries re- ceive tends to constitute a significant proportion of the high- the performance of the corresponding linear model, provid- ing an indirect indication for the inappropriateness of the selected features. Figure 4 draws a comparison between the inferences of the best nonlinear and linear models, both of which happen to use the same feature basis (r>.30 ∩ S>µS+σS). The GP model provides more smooth estimates and an overall better balance between stronger and milder flu seasons. It Elastic net with and without word embeddings filtering prof. surname (70.3%), name surname (27.2%), heal the world (21.9%), heating oil (21.2%), name surname recipes (21%), tlc diet (13.3%), blood game (12.3%), swine flu vaccine side effects (7.2%) ratio over highest weight
  • 89. GFT v.3 — Performance (3/3) (a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP (a) (b) (c) (d) 25.81% 30.30% 35.88%34.17% 0.160.17 0.230.22 0.940.930.920.89 r MAE x 0.1 MAPE
  • 90. GFT v.3 — Performance (3/3) (a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP (a) (b) (c) (d) 25.81% 30.30% 35.88%34.17% 0.160.17 0.230.22 0.940.930.920.89 r MAE x 0.1 MAPE
  • 91. Multi-task learning m tasks (problems) t1,…,tm observations Xt1,yt1,…,Xtm,ytm learn models fti: Xti → yti jointly (and not independently) Why? When tasks are related, multi-task learning is expected to perform better than learning each task independently Model learning possible even with a few training samples
  • 92. Multi-task learning for disease modelling m tasks (problems) t1,…,tm observations Xt1,yt1,…,Xtm,ytm learn models fti: Xti → yti jointly (and not independently) Can we improve disease models (flu) from online search: when sporadic training data are available? across the geographical regions of a country? across two different countries?
  • 93. Multi-task learning GFT (1/5) Department of Computer Science University College London United Kingdom v.lampos@ucl.ac.uk Department of Computer Science University College London United Kingdom i.cox@ucl.ac.uk ing to disease surveil- on is two-fold. Firstly, models for various ge- erent countries — can such models to assist c disease surveillance training data. We ex- ecically a multi-task Gaussian Process, and formulations. We use nduct experiments on where both health and pirical results indicate well as national mod- ent on mean absolute aining data is reduced models can be obtained, rvals. Furthermore, in reports (training data) ing helps to maintain locations. Finally, we ment, where data from As the historical train- of multi-task learning HHS Regions CA OR WA MT ID NV UT AZ WY CO NM TX OK KS NE SD ND MN WI IA IL MO AR LA MS AL GA FL SC NC TN KY IN MI OH WV VA MD DE PA NJ NY CT MA VT NH ME RI AK HI Region 1 Region 2 Region 3 Region 4 Region 5 Region 6 Region 7 Region 8 Region 9 Region 10 Figure 1: The 10 US regions as specied by the Department of Health Human Services (HHS). 1 INTRODUCTION Online user-generated content contains a signicant amount of information about the oine behavior or state of users. For the Can multi-task learning across the 10 US regions help us improve the national ILI model?
  • 94. Multi-task learning GFT (1/5) Can multi-task learning across the 10 US regions help us improve the national ILI model? Elastic Net MT Elastic Net GP MT GP 0.250.25 0.350.35 0.970.970.960.96 r MAE 5 years of training data
  • 95. Multi-task learning GFT (1/5) Can multi-task learning across the 10 US regions help us improve the national ILI model? Elastic Net MT Elastic Net GP MT GP 0.44 0.51 0.46 0.53 0.880.850.870.85 r MAE 1 year of training data
  • 96. Multi-task learning GFT (2/5) Can multi-task learning across the 10 US regions help us improve the regional ILI models? Elastic Net MT Elastic Net GP MT GP 0.47 0.54 0.490.53 0.870.840.860.85 r MAE 1 year of training data
  • 97. Multi-task learning GFT (3/5) Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample)
  • 98. Multi-task learning GFT (3/5) Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample) ch TheWebConf 2018, 23–27 Apr. 2018, Lyon, France od- to in- ng E 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.5 0.6 0.7 0.8 0.9 1.0 MAE EN MTEN GP MTGP Figure 5: Comparing the performance of EN (dotted), GP (dashed), MTEN (dash dot) and MTGP (solid) on estimating the ILI rates for US HHS Regions (except Regions 4 and 9) for varying burst error sampling (type C) rates ( ).
  • 99. Multi-task learning GFT (4/5) Correlations between US regions induced by the covariance matrix of the MT GP model Multi-task learning model seems to be capturing existing geographical relations and MTGP (blue) ILI estimates for England under varying training data sizes. re else. Fig. 5 compares under this scenario. It evious experiment still much less aected by in single task learning decreases. s Countries whether a stable data ce a disease model for a underlying assumption on language and have patterns of user search US and England and as- rical health reports for xperiments described in data, we always assume past L = 5 years. The e same, with the follow- about medication were heir search frequencies “robitussin” and “z pak” ults as in the previous ls register statistically ngle task learning ones. ced, the improvements to be inferring the trends of the time series correctly, the multi-task estimates are more close to the actual values of the signal’s peaks. The results conrm our original hypothesis that data from one country could improve a disease model for another country with similar characteristics. This motivates the development of more advanced transfer learning schemes [41], capable of operating be- tween countries with dierent languages by overcoming language barrier problems, using variants of machine translation. R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 US 0.64 0.72 0.80 0.88 0.96 A1 A2
  • 100. Multi-task learning GFT (5/5) Can multi-task learning across countries (US, England) help us improve the ILI model for England? Elastic Net MT Elastic Net GP MT GP 0.47 0.60 0.49 0.70 0.900.890.900.89 r MAE 5 years of training data
  • 101. Multi-task learning GFT (5/5) Can multi-task learning across countries (US, England) help us improve the ILI model for England? Elastic Net MT Elastic Net GP MT GP 0.59 0.98 0.60 1.00 0.860.850.860.84 r MAE 1 year of training data
  • 102. Conclusions Online (user-generated) data can help us improve our current understanding about public health matters The original Google Flu Trends was based on a good idea, but on very limited modelling effort, resulting to major errors Subsequent models improved the statistical modelling as well as the semantic disambiguation between possible features and delivered better / more robust performance Multi-task learning improves disease models further Future direction: Models without strong supervision
  • 103. Acknowledgements Industrial partners — Microsoft Research (Elad Yom-Tov) — Google Public health organisations — Public Health England — Royal College of General Practitioners Funding: EPSRC (“i-sense”) Collaborators: Andrew Miller, Bin Zou, Ingemar J. Cox
  • 104. Thank you. Vasileios Lampos (a.k.a. Bill) Computer Science University College London @lampos
  • 105. References Ginsberg et al. Detecting influenza epidemics using search engine query data. Nature 457, pp. 1012—1014 (2009). Lampos, Miller, Crossan and Stefansen. Advances in nowcasting influenza-like illness rates using search query logs. Scientific Reports 5, 12760 (2015). Lampos, Zou and Cox. Enhancing feature selection using word embeddings: The case of flu surveillance. WWW ’17, pp. 695–704 (2017). Zou, Lampos and Cox. Multi-task learning improves disease models from Web search. WWW ’18, In Press (2018). GFT v.1 GFT v.2 GFT v.3 MTL