2. Our businesses are
constantly evolving…
• Our digital products (apps and platforms) are
constantly developing
• The questions we ask of our data are constantly
changing
• It is critical that the analytics stack can evolve
with your business
3. Self-describing data Event data modeling+
Event data pipeline that
evolves with your business
How Snowplow users evolve their
analytics stacks with their business
6. As a Snowplow user, you can
define your own events and entities
Events
Entities
(contexts)
• Build castle
• Form alliance
• Declare war
• Player
• Game
• Level
• Currency
• View product
• Buy product
• Deliver product
• Product
• Customer
• Basket
• Delivery van
8. Then send data into
Snowplow as self-
describing JSONs
1. Validation
2. Dimension
widening
3. Data
modeling
{
“schema”: “iglu:com.israel365/
temperature_measure/jsonschema/1-0-0”,
“data”: {
“timestamp”: “2016-11-16 19:53:21”,
“location”: “Berlin”,
“temperature”: 3
“units”: “Centigrade”
}
}
{
"$schema": "http://iglucentral.com/schemas/
com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for an ad impression
event",
"self": {
"vendor": “com.israel365",
"name": “temperature_measure",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
"properties": {
"timestamp": {
"type": "string"
},
"location": {
"type": "string"
},
…
},
…
Event
Schema
reference
Schema
9. The schemas can then be
used in a number of ways
• Validate the data (important for data quality)
• Load the data into tidy tables in your data
warehouse
• Make it easy / safe to write downstream data
processing application (e.g. for real-time users)
11. What is event data modeling?
1. Validation
2. Dimension
widening
3. Data
modeling
Event data modeling is the process of using business logic to aggregate over
event-level data to produce 'modeled' data that is simpler for querying.
13. In general, event data modeling is
performed on the complete event stream
• Late arriving events can change the way you
understand earlier arriving events
• If we change our data models: this gives us the
flexibility to recompute historical data based on the
new model
15. How do we handle pipeline
evolution?
PUSH
FACTORS:
What is being
tracked will
change over
time
PULL
FACTORS:
What questions are
being asked of the
data will change
over time
Businesses are not static, so event pipelines should not be either
Web
Apps
Servers
Comms channels
Push …
Data
warehouse
Data exploration
Predictive modeling
Real-time dashboards
Real-time,
data-driven
applications
RT
Bidder
Voucher
Person-
alization
…
Collection Processing
Smart car / home
…
16. Push example:
new source of event data
• If data is self-describing it is easy to add an additional
sources
• Self-describing data is good for managing bad data
and pipeline evolution
I’m an email send event and I have
information about the recipient (email
address, customer ID) and the email
(id, tags, variation)
18. Answering the question:
3 possibilities
1. Existing data model
supports answer
2. Need to update data
model
3. Need to update data
model and data
collection
• Possible to answer
question with existing
modeled data
• Data collected
already supports
answer
• Additional
computation required
in data modeling step
(additional logic)
• Need to extend event
tracking
• Need to update data
models to
incorporate
additional data (and
potentially additional
logic)
19. Self-describing data and the ability to recompute data
models are essential to enable pipeline evolution
Self-describing data Recompute data models on entire data set
• Updating existing events and entities in
a backward compatible way e.g. add
optional new fields
• Update existing events and entities in a
backwards incompatible way e.g. change
field types, remove fields, add compulsory fields
• Add new event and entity types
• Add new columns to existing derived
tables e.g. add new audience segmentation
• Change the way existing derived tables
are generated e.g. change sessionization logic
• Create new derived tables