Simulmedia ARF Presentation - Early Lessons Learned In Applying Big Data To Television Advertising
1. Early Lessons Learned in Applying
Big Data To TV Advertising
ARF September 12, 2011
Jack Smith, Chief Product Officer, Simulmedia
2. About
Us
Who
We
Are
We
are
a
New
York
based
start-‐up.
We
are
venture
backed
by
Avalon
Ventures,
Union
Square
Ventures
and
Time-‐Warner.
Where
We
Have
Been
Our
35
person
team
has
veterans
of:
What
We
Believe
Television
is
sHll
the
most
powerful
adverHsing
medium
in
the
world.
While
addressability
will
come,
we’re
not
waiHng
for
it.
We’ve
taken
a
few
strategies
we
learned
from
the
Internet
and
are
applying
it
to
linear
TV
adverHsing,
today.
How
We
Do
It
Through
partnerships
with
major
data
providers,
we
have
assembled
the
world’s
largest
set
of
acHonable
television
data.
How
We
Make
Money
sell
television
adverHsing.
With
inventory
in
over
106
million
US
We
households,
we
can
cost-‐effecHvely
extend
reach
into
high-‐value
target
audiences
across
virtually
any
adverHser
category.
We
use
big
data
and
science
to
do
this.
2
3. Why
Did
We
Leave
The
Web?
Television
remains
the
dominant
consumer
medium
(a)
Nielsen
US
TV
Viewing
Audicence
TradiHonal
Live-‐Only
TV
based
on
average
monthly
viewing
during
1Q2011.
Internet
and
Online
Video
based
on
average
monthly
consumpHon
during
July
2011.
Video
on
Demand
based
on
consumpHon
during
May
2011.
3
6. Campaign
Reach
Is
Declining
Impossible
for
measurement
and
planning
tools
to
keep
pace
Source:
Simulmedia
analysis
of
data
from
SQAD,
Nielsen
and
TVB
6
8. Big
Data
Is
Driving
Growth
“We
are
on
the
cusp
of
a
tremendous
wave
of
innova;on,
produc;vity
and
growth,
as
well
as
new
modes
of
compe;;on
and
value-‐capture
–
all
driven
by
Big
Data.”
-‐
McKinsey
Global
InsHtute,
May
2011
“For
CMOs,
Big
Data
is
a
very
big
deal.”
-‐
Alfredo
Gangotena,
CMO,
Mastercard,
July
2011
8
16. But
Big
Data
Is
More
Than
Size
BIG
DATA
What
Why
did
it
What’s
going
to
happened?
happen?
happen
next?
Time:
Past
Future
Focus:
ReporHng
PredicHon
Supports:
Human
Machine
decisions
decisions
Data:
Structured
Unstructured
Aggregated
Unaggregated
Human
Dashboards
Discovery
Skills:
Excel
VisualizaHon
StaHsHcs
&
Physics
16
17. AcceleraEng
The
Push
To
Big
Data
Hadoop,
cloud
compuHng,
Facebook,
Yahoo,
quants,
Biforrent,
machine
learning,
Stanford,
large
hadron
collider,
Wal-‐Mart,
text
processing,
Amazon
S3
&
EC2,
open
source
intelligence,
NoSQL,
social
media,
Google,
commodity
hardware,
Hive,
fraud
detecHon,
trading
desks,
MapReduce,
natural
language
processing
17
18. What
Can
It
Mean
For
TV
AdverEsing?
Big
data
drove
the
rise
of
web
&
search
adver;sing
• AccumulaHon
of
high
volume
of
direct
measurement
of
media
consumpHon
• Befer
predicHons
about
consumer
interests
• Real
Hme
return
path
• AutomaHon
• Interim
step
for
addressability
• More
diligence
around
consumer
privacy
• Media
buyers
and
sellers
rethinking
their
approach
to
audience
packaging,
campaign
planning,
technology,
data
assembly
and
people
18
19. Post
Modern
Architecture
Have
we
reached
the
limits
of
classic
data
storage
architecture?
Data
Warehouses
Data
Lakes
• Yahoo!:
700
tb1
• Facebook:
30
pb3
(7x
• Australian
Bureau
of
StaHsHcs:
250
tb1
compression)
• AT&T:
250
tb1
• Yahoo:
22
pb4
• Nielsen:
45
tb1
• Google:
???
• Adidas:
13
tb1
• Wal-‐Mart:
1
pb2
1
Oracle
F1Q10
Earnings
Call
September
16,
2009
Transcript
2
Stair,
Principles
of
Informa;on
Systems,
2009,
p
181
3
Dhruba
Borthakur,
Facebook,
December
2010,
hfp://www.facebook.com/note.php?note_id=468211193919
4
Simulmedia
esHmate
19
20. Our
Idea
of
Big
Data
Bringing
the
data
set
together
in
a
single
plaMorm
Client
Nielsen
Set
Top
Boxes
Program
Public
Ad
Occurrence
Proprietary
RaHngs
• 17+
million
• 3
different
• US census • What ads • Business
• All
Minute
boxes
sets
of
• Military ran? Development
Respondent
• Completely
schedule
• Business • Where did Indices
(BDI)
Level
Data
anonymous
data
they run? • Commercial
(AMRLD)
viewing
• Proprietary
Development
• Live
metadata
Indices
(CDI)
• DVR
• Regional
• VOD
sales
data
• Pay
channels
Our
(comparaHvely
modest)
data
set:
• 200
tb
(approx.
7x
compression)
• 113,858,592
daily
events
• Approximately
402,301
weekly
ads
• Double
capacity
every
6
months
…And
we
don’t
load
every
data
point
across
all
data
sets,
yet
20
21. Rethinking
Media
Data
Architecture
Applying
big
data
to
television
required
us
to
rethink
what
our
technical
architecture
should
be
Commodity
• No
clouds
allowed
(ISO
compliance)
Hardware
• Expect
hardware
failure
Open
Source
• Learn
from
those
who
have
done
it
Sosware
• ParHcipate
in
the
Open
Source
community
• ELT
(Extract,
Load,
Transform)
Write
Your
Own
• Meddle
Sosware
• Machine
learning
• Advanced
staHsHcal
techniques
Science
• ExperimentaHon
21
23. The
People
We
Needed
A
different
approach
required
different
skill
sets
• New
core
skills
for
everyone
in
the
company
• Pafern
recogniHon
• VisualizaHon
• Technology
• ExperimentaHon
• Where
do
you
find
hard
to
find
tech
skills?
• You
don’t
find
them.
You
make
them.
• A
dedicated
Science
team
• Non
tradiHonal
researchers
(Brain
imaging,
bioinformaHcs,
economic
modeling,
geneHcs)
• People
who
watch
a
lot
of
television
23
25. Some
Things
To
Know,
First
• Live
viewing
unless
otherwise
noted
• Time
shising
lessons
is
a
whole
other
presentaHon
• Time
shising
+
live
viewing
lessons
is
a
whole
other
other
presentaHon
• Video
on
demand
is
a
whole
other
other
other
presentaHon
• We
name
names
and
provide
numbers
where
clients
and
data
partners
permit
• Client
confidenHality
is
important
to
us
• None
of
this
work
would’ve
been
possible
without
the
help
of
our
clients
and
partners
This
box
will
contain
important
Read
me…
informaHon
about
the
graphs
on
each
page.
25
26. 60%
of
TV
Viewers
Watch
90%
of
TV
Highly
ConfidenHal
27. Where
The
Other
40%
Are
TCM 13.6
HALLMARK 13.7
Networks with
relatively fewer ADSWIM 14.0
lighter viewer NICKNITE 14.3
impressions CNBC 15.7
FOX NEWS 18.0
OXYGEN 7.4
Networks with
relatively more WE 7.6
lighter viewer PLANET 7.7
VerEcal:
RaHo
of
Heavy
impressions GREEN
Viewers
to
light
viewer
OVATION 7.8
impressions.
STYLE 7.8
Horizontal:
Low
rated
to
Highly
rated
networks
MTV2 7.8
Call
outs:
RaHo
is
the
SUNDANCE 7.9
number
of
Heavier
Viewer
impressions
you
IFC 7.9
Lower Higher rated
would
deliver
to
reach
a
rated networks
Lighter
Viewer
on
a
given
networks
network
Sources:
Nielsen
&
Simulmedia’s
a7
27
28. Where
The
Other
40%
Are
To
capture
light
viewers,
media
planning
and
measurement
tools
must
quickly
apply
new
methods
to
emerging
data
sets
28
30. When
Data
Goes
Missing
AutomaHon
of
error
checking/
quality
control
is
essenHal
Reuse
the
data
to
solve
other
problems
Occasionally
observe
missing
data
Three
choices:
• Pick
up
the
phone
• EsHmate
missing
fields
• Work
around
the
missing
data
Time
series
of
SYFY
network.
10645
observaEons
from
2010.02.28
at
7:00pm
Eastern
to
2010.10.14
at
12:30pm
Eastern
30
Source:
Simulmedia’s
a7
33. The
RevoluEon
of
Simple
Methods
More
data
beats
beUer
algorithms.
The
best
performing
algorithm
underperforms
the
worst
algorithm
when
given
an
order
of
magnitude
more
data.
Simple
algorithms
at
very
large
scale
can
help
befer
Peter
Norvig
|
Internet
Scale
Data
Analysis
|
June
21,
2010
predict
audience
movement.
Original
graph
sourced
from:
Banko
&
Brill,
2001.
Mi;ga;ng
the
paucity-‐of-‐data
problem:
exploring
the
effect
of
training
corpus
size
on
classifier
performance
for
natural
language
processing
33
34. Packaging
Reach
Very
large
data
sets
beUer
predict
TV
audience
movements
Peter
Norvig
|
Internet
Scale
Data
Analysis
|
June
21,
2010
34
35. The
Cost
Of
More
Data
More
data
drives
beUer
results
but
there
are
costs
• All
data
online.
All
the
• All
data
online.
All
the
Hme.
Hme.
• Less
expensive
hardware
• More
expensive
talent
• Extremely
flexible
• Physicists
&
staHsHcians
ain’t
cheap
• Hard
to
find
programmers
• Not
everything
meets
your
needs
• Evolving
technologies
in
mission
criHcal
funcHons
35
36. The
Data
Isn’t
Biased
Just
Because
It
Comes
From
A
Set
Top
Box
Highly
ConfidenHal
37. Applying
Simple
Methods
At
Scale
High
correlaHon
of
a7
measures
and
Nielsen
esHmates.
Either
bias
is
insignificant
or
Nielsen
data
and
our
data
share
the
same
bias.
MulHple
methods
yield
similar
results
Regression
analysis
of
Nielsen
Household
Cume
RaEng
against
Simulmedia’s
a7
cume
raEng.
20
PrimeEme
Network
shows
with
Sources:
Nielsen
&
Simulmedia’s
a7
HAWAII
FIVE-‐0.
Fall
2010.
37
38. And
Then
We
Kept
Going
We
measured
program
Tune-‐In,
Spot
Tune-‐In,
Campaign
Reach,
Campaign
Ra;ng
using
mul;ple
slices
of
our
data
set
using
two
different
sample
sets
and
;me
frames
How
we
sliced
it
Two
samples
• EnHre
a7
data
set
1. Sample
1:
Fall
2010:
20
PrimeHme
• Cross
correlated
individual
data
broadcast
series
launches
+
sets
contained
in
a7
aggregate
promos
2. Sample
2:
Jan
2011:
15
PrimeHme
data
set
cable
series
premieres
+
promos
• Aggregate
cross
geographies
(Plus
one
mulH-‐season/year
(DMA
to
DMA)
primeHme
broadcast
premiere
+
promos)
ObservaEons
• Sample
1
average
r2>0.85
• Hand
selected
programs
• Sample
2
average
r2>0.93
• Mix
of
genres
• Mix
of
new
vs.
returning
shows
38
40. Closing
The
Loop
On
Program
PromoEon
Spring
2010
broadcast
premiere
promoEon.
Horizontal:
Leb
to
right
moves
back
in
Eme.
0
is
the
premiere
Eme.
VerEcal:
Conversion
rate
is
measured
in
percent.
Size
of
Sources:
Simulmedia’s
a7
the
bubble
represents
total
conversions
for
a
given
spot.
40
41. Closing
The
Loop
On
Program
PromoEon
Spring
2010
broadcast
premiere
promoEon.
Horizontal:
Leb
to
right
moves
back
in
Eme.
0
is
the
premiere
Eme.
VerEcal:
Conversion
rate
is
measured
in
percent.
Size
of
Sources:
Simulmedia’s
a7
the
bubble
represents
total
conversions
for
a
given
spot.
41
42. Closing
The
Loop
Long
held
beliefs
and
rules
of
thumb
in
planning
may
or
may
not
be
supported
by
data
TV
marketers
now
have
more
opHons
for
show
promoHon
42
44. Time
Series:
Broadcast:
CBS
60
networks.
High
correla;on
between
Nielsen
large
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
sample
measurement
and
a7
measures
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
44
45. Time
Series:
Broadcast:
Fox
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
45
46. Time
Series:
Broadcast:
ABC
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
46
47. Time
Series:
Cable:
InvesEgaEon
Discovery
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
47
48. Time
Series:
Cable:
Golf
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
48
49. Time
Series:
Cable:
Bravo
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
49
50. Time
Series:
Cable:
ESPN2
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
50
51. Time
Series:
Cable:
Speed
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
51
53. When
You
Look
Closer
Hour
by
hour
Hme
series
Mar
20
to
April
8,
2011.
Z
score
plots
with
Nielsen
esHmates
in
red.
Simulmedia
measurements
in
blue.
Where
Nielsen
provided
no
esHmate,
esHmates
were
imputed
using
MulHple
ImputaHon
(Rubin
(1987))
Sources:
Nielsen
&
Simulmedia’s
a7
53
54. High
Frequency
Time
Series:
ABC
Family
Vola;lity
in
dayparts,
low
rated
networks,
demographics….
Unrated
networks
“don’t
exist.”
Did
NOT
look
at
local.
a7
Nielsen
Sample
graph
from
High
Frequency
(Second
and
Minute
level)
Time
Series
Analysis
of
45
networks
on
January
19th
2011.
Simulmedia
a7
Sample
(Second
by
Second
to
Minute)
Nielsen
Sample
(Minute
by
Minute)
54
Sources:
Nielsen
&
Simulmedia’s
a7
56. Gender
Driven
Geographic
VariaEon
Viewing
by
zip
code
among
women
across
markets
is
more
varied
than
men
in
the
same
zip
codes
Women
18-‐54
Men
18-‐54
FracHon
of
view
Hme
for
ages
18-‐54
as
fracHon
of
view
Hme
for
all
TV
viewers.
Week
2
vs.
the
same
fracHon
for
week
1
(last
two
weeks
in
January).
Three
markets:
Philadelphia
(blue)
Atlanta
(red)
and
Chicago
(green)
Each
Source:
Simulmedia’s
a7
point
represents
a
zip
code
in
one
of
these
markets.
56
57. Gender
Driven
Geographic
VariaEon
Planning
tac;cs
for
female
targeted
campaigns
should
be
different
than
male
target
campaigns
PS…Also
a
good
case
for
geo
based
crea;ve
versioning
57
59. Privacy
By
Design
• All
markeHng
data
companies
need
to
care
• Make
consumer
privacy
protecHon
part
of
the
business
from
the
beginning
• Anonymous,
aggregated
data
only
• No
personal
data
or
data
that
can
be
related
to
parHcular
individuals
or
devices
• Broad
markeHng
segmentaHons,
not
profiling
• No
sensiHve
data
Don’t
be
creepy
59
61. FragmentaEon
Effects
On
Frequency
Each
segment
was
above
70%
reach
but
the
frequency
distribu;on
was
nearly
iden;cal
Percent
of
audience
reached
for
major
animated
moHon
picture
campaign
2011.
Two
weeks
prior
to
release.
Each
stacked
bar
is
a
different
audience
segment.
Each
color
Source:
Nielsen
&
Simulmedia’s
a7
with
the
stacked
bar
represents
the
frequency
of
ad
view
for
each
segment.
61
62. FragmentaEon
Effects
On
Frequency
Fragmenta;on
is
affec;ng
all
high
reach
campaigns.
Percent
of
audience
reached
for
insurance
adverHsers
September
to
October
2010.
Approximately
8000
ads.
Each
stacked
bar
is
a
different
audience
segment.
Each
Source:
Nielsen
&
Simulmedia’s
a7
color
with
the
stacked
bar
represents
the
frequency
of
ad
view
for
each
segment.
62
63. FragmentaEon
Effects
On
Frequency
The
TV
adverHsing
market
can’t
conHnue
to
support
this
63
64. 40%
Of
The
Audience
Is
Geyng
85%
Of
The
Impressions
Highly
ConfidenHal
65. FragmentaEon
Rears
It’s
Head
Again
Campaign
impressions
increasingly
concentrated
against
0.0
0.0%
heavy
viewers.
1.4
3.6%
Total
US
Television
4.3
10.8%
Audience
Percent
of
audience
reached
for
a
different
9.1
23.0%
major
animated
moHon
picture
campaign
2011.
Two
weeks
prior
to
release.
The
stacked
bar
24.8
62.6%
represents
quinHles.
Blue
labels
are
average
frequency
per
Average
Frequency
%
of
Total
Impressions
respecHve
quinHle.
Red
Per
QuinEle
Per
QuinEle
labels
are
%
of
total
campaign
impressions
Source:
Nielsen
&
Simulmedia’s
a7
by
respecHve
quinHle.
65
68. Choices
• If
fragmentaHon
is
causing
declining
campaign
reach
and
frequency
imbalances,
marketers
must
make
choices.
• Reduce
reach
• Do
nothing
• Use
other
channels
• Stabilize
or
improve
reach
• Re-‐aggregate
audiences
using
big
data
What
do
you
think?
68
69. Jack Smith
jack@simulmedia.com
@simulmedia
@jkellonsmith
69
70. About
Our
Science
Team
• Krishna
Balasubramanian,
Chief
ScienHst
• Previously:
Chief
ScienHst,
Tacoda.
Chief
ScienHst,
Real
Media.
• Doctoral
Candidate,
Physics.
(Condensed
Mafer
Physics)
The
Ohio
State
University
• MS,
Computer
&
InformaHon
Systems.
The
Ohio
State
University
• MSc,
Physics.
Indian
Ins;tute
of
Technology,
Kanpur
• Yuliya
Torosjan,
ScienHst
• Previously:
Clinical
Research
(Brain
Imaging),
Mount
Sinai
College
of
Medicine
• MA,
StaHsHcs.
Columbia
University
• BSE,
Computer
Science
&
Engineering.
University
of
Pennsylvania
• BA,
Psychology.
University
of
Pennsylvania
• Mario
Morales,
ScienHst
• Previously:
Lecturer,
BioinformaHcs,
New
York
University.
Senior
Consultant,
Weiser
LLP.
• MS,
StaHsHcs.
Hunter
College
• MS,
BioinformaHcs.
New
York
University
• Dr.
Sidd
Mukherjee,
ScienHst
• Previously,
VisiHng
Scholar
(Atomic
Scafering
experiments),
The
Ohio
State
University
• Post
doctoral
research,
Heat
capacity
of
Helium-‐4.
Pennsylvania
State
University
• PhD,
Physics.
(Thesis:
Measurements
of
Diffuse
and
Specular
Scafering
of
4He
Atoms
from
4He
Films),
Ohio
State
University
• MS,
Computer
&InformaHon
Systems.
The
Ohio
State
University
• BSc,
Physics
&
MathemaHcs.
University
of
Bombay
70