4. Snowplow
is
an
open-‐source
web
and
event
analy8cs
pla<orm,
first
version
released
in
early
2012
• Co-‐founders
Alex
Dean
and
Yali
Sassoon
met
at
OpenX,
the
open-‐source
ad
technology
business
in
2008
• ASer
leaving
OpenX,
Alex
and
Yali
set
up
Keplar,
a
niche
digital
product
and
analy;cs
consultancy
• We
released
Snowplow
as
a
skunkworks
prototype
at
start
of
2012:
github.com/snowplow/snowplow
• We
started
working
full
;me
on
Snowplow
in
summer
2013
5. We
wanted
to
take
a
fresh
approach
to
web
analy8cs
• Your
own
web
event
data
-‐>
in
your
own
data
warehouse
• Your
own
event
data
model
• Slice
/
dice
and
mine
the
data
in
highly
bespoke
ways
to
answer
your
specific
business
ques;ons
• Plug
in
the
broadest
possible
set
of
analysis
tools
to
drive
value
from
your
data
Data
warehouse
Data
pipeline
Analyse
your
data
in
any
analysis
tool
6. By
spring
2013
we
had
arrived
at
a
rela8vely
stable
batch-‐based
processing
architecture
Website
/
webapp
Snowplow
Hadoop
data
pipeline
CloudFront-‐
based
event
collector
Scalding-‐
based
enrichment
on
Hadoop
JavaScript
event
tracker
Amazon
RedshiS
/
PostgreSQL
Amazon
S3
or
Clojure-‐
based
event
collector
8. Snowplow
is
evolving
from
a
web
analy8cs
pla<orm
into
a
general
event
analy8cs
pla<orm
Data
warehouse
Collect
event
data
from
any
connected
device
9. Web
analysts
work
with
a
small
number
of
event
types
–
outside
of
web,
the
number
of
possible
event
types
is…
infinite
Web
events
All
events
• Page
view
• Order
• Add
to
basket
• Page
ac;vity
• Game
saved
• Machine
broke
• Car
started
• Spellcheck
run
• Screenshot
taken
• Fridge
empty
• App
crashed
• Disk
full
• SMS
sent
• Screen
viewed
• Tweet
draSed
• Player
died
• Taxi
arrived
• Phonecall
ended
• Cluster
started
• Till
opened
• Product
returned
∞
10. There
are
two
historic
approaches
to
dealing
with
the
explosion
of
possible
event
types
Web
analy8cs
vendors
Mobile
and
app
analy8cs
vendors
Custom
Variables
Schema-‐less
JSONs
11. Custom
variables
are
very
restric8ve
1. Take
a
standard
web
event,
like
a
page
view:
2. and
add
custom
variables
un;l
it
becomes
something
totally
different:
=
a
“taxi
arrived”
event,
kind
of!
Page
View
Page
View
vehicle=taxi23
status=arrived
+
+
12. Schema-‐less
JSONs
are
beWer,
but
they
have
a
different
set
of
problems
Issues
with
the
event
name:
• Separate
from
the
event
proper;es
• Not
versioned
• Not
unique
–
HBO
video
played
versus
Brightcove
video
played
Lots
of
unanswered
ques;ons
about
the
proper;es:
• Is
length
required,
and
is
it
always
a
number?
• Is
id
required,
and
is
it
always
a
string?
• What
other
op;onal
proper;es
are
allowed
for
a
video
play?
Other
issues:
• What
if
the
developer
accidentally
starts
sending
“len”
instead
of
“length”?
The
data
will
end
up
split
across
two
separate
fields
• Why
does
the
analyst
need
to
keep
an
implicit
schema
in
their
head
to
analyze
video
played
events?
14. When
a
developer
or
analyst
defines
a
new
event
in
JSON,
let’s
ask
them
to
create
a
JSON
Schema
for
that
event
Addi;onal
op;onal
field
we
might
not
know
about
otherwise
No
other
fields
allowed
Yes
length
should
always
be
a
number
15. But
we
need
to
let
our
event
defini8ons
evolve,
so
let’s
add
versioning
–
we’re
calling
this
SchemaVer
MODEL-REVISION-ADDITION!
• Start
versioning
at
1-‐0-‐0
–
so
1-‐0-‐0,
1-‐0-‐1,
1-‐0-‐2,
1-‐1-‐0
etc
• Try
to
s;ck
to
backwards-‐compa;ble
ADDITION
upgrades
as
much
as
possible
16. Where
are
our
schemas
going
to
live?
We
need
a
schema
repository/registry
Schema
repo
{}!
Enrichment
Manager
Raw
events
in
JSON
format
Enriched
events
in
ThriS
or
Arvo
format
Shredder
1.
Test
instrumenta;on
2.
Validate
events
3.
Define
structure
4.
Drive
shredding
Enriched
events
in
TSV
ready
for
loading
into
db
5.
Define
structure
17. We
need
to
namespace
our
schemas
properly
to
prevent
clashes
and
confusion
in
our
schema
repository
iglu:com.channel2.vod/video_played/jsonschema/1-0-0!
We
are
calling
our
schema
methodology
“Iglu”
The
vendor
of
this
event
Event
name
Schema
format
Schema
version
18. Bringing
it
all
together,
let’s
now
make
the
event
JSONs
self-‐
describing,
with
a
schema
header
and
data
body
19. And
for
good
measure,
let’s
add
in
our
schema
informa8on
into
the
JSON
Schema
itself
22. We
are
also
star8ng
to
define
third-‐party
events
for
Snowplow
integra8on,
star8ng
with
Zendesk
customer
support
events
23. Ques8ons?
hlp://snowplowanaly;cs.com
hlps://github.com/snowplow/snowplow
@snowplowdata
To
chat
–
@alexcrdean
on
Twiler
or
alex@snowplowanaly;cs.com