SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Introducing Snowplow 
Big Data Beers, Berlin 
Huge thanks to Zalando for hosting!
Snowplow is an open-source web and event analytics platform, 
first version released in early 2012 
• Co-founders Alex Dean and Yali Sassoon met at 
OpenX, the open-source ad technology business 
in 2008 
• After leaving OpenX, Alex and Yali set up Keplar, 
a niche digital product and analytics consultancy 
• We released Snowplow as a skunkworks 
prototype at start of 2012: 
github.com/snowplow/snowplow 
• We started working full time on Snowplow in 
summer 2013
At Keplar, we grew frustrated by significant limitations in 
traditional web analytics programs 
Data collection Data processing Data access 
• Sample-based (e.g. 
Google Analytics) 
• Limited set of events e.g. 
page views, goals, 
transactions 
• Limited set of ways of 
describing events 
(custom dim 1, custom 
dim 2…) 
• Data is processed ‘once’ 
• No validation 
• No opportunity to 
reprocess e.g. following 
update to business rules 
• Data is aggregated 
prematurely 
• Only particular 
combinations of metrics 
/ dimensions can be 
pivoted together 
(Google Analytics) 
• Only particular type of 
analysis are possible on 
different types of 
dimension (e.g. sProps, 
eVars, conversion goals 
in SiteCatalyst 
• Data is either aggregated 
(e.g. Google Analytics), 
or available as a 
complete log file for a 
fee (e.g. Adobe 
SiteCatalyst) 
• As a result, data is siloed: 
hard to join with other 
data sets
And we saw the potential of new “big data” technologies and 
services to solve these problems in a scalable, low-cost manner 
CloudFront Amazon S3 Amazon EMR Amazon Redshift 
These tools make it possible to capture, transform, store and analyse all your 
granular, event-level data, to you can perform any analysis
We wanted to take a fresh approach to web analytics 
• Your own web event data -> in your own data warehouse 
• Your own event data model 
• Slice / dice and mine the data in highly bespoke ways to answer your 
specific business questions 
• Plug in the broadest possible set of analysis tools to drive value from your 
data 
Data pipeline Data warehouse 
Analyse your data in 
any analysis tool
Early on, we made a crucial decision: Snowplow should be 
composed of a set of loosely coupled subsystems 
1. Trackers 2. Collectors A B 3. Enrich C 4. Storage D 5. Analytics 
Generate event 
data from any 
environment 
Launched with: 
• JavaScript 
tracker 
Log raw events 
from trackers 
Launched with: 
• CloudFront 
collector 
Validate and 
enrich raw 
events 
Launched with: 
• HiveQL + 
Java UDF-based 
enrichment 
D = Standardised data protocols 
Store enriched 
events ready 
for analysis 
Launched with: 
• Amazon S3 
Analyze 
enriched 
events 
Launched with: 
• HiveQL 
recipes 
These turned out to be critical to allowing us 
to evolve the above stack
Our initial skunkworks version of Snowplow – it was basic but it 
worked, and we started getting traction 
Website / webapp 
Snowplow data pipeline v1 (spring 2012) 
CloudFront-based 
pixel 
collector 
HiveQL + 
Java UDF 
“ETL” 
Amazon S3 
JavaScript 
event tracker
What did people start using it for? 
Warehousing their 
web event data 
To enable… 
Agile aka ad hoc 
analytics 
Marketing 
attribution 
modelling 
Customer lifetime 
value calculations 
Customer churn 
detection 
RTB fraud 
Product 
recommendations
Current Snowplow design 
and architecture
Our protocol-first, loosely-coupled approach made it possible to 
start swapping out existing components… 
Website / webapp 
Snowplow data pipeline v2 (spring 2013) 
CloudFront-based 
event 
collector 
Scalding-based 
enrichment 
JavaScript 
event tracker 
HiveQL + 
Java UDF 
“ETL” 
Amazon S3 
Amazon 
Redshift / 
PostgreSQL 
or 
Clojure-based 
event 
collector
Our protocol-first, loosely-coupled approach made it possible to 
start swapping out existing components… 
Website / webapp 
Snowplow data pipeline v2 (spring 2013) 
CloudFront-based 
event 
collector 
Scalding-based 
enrichment 
JavaScript 
event tracker 
HiveQL + 
Java UDF 
“ETL” 
Amazon S3 
Amazon 
Redshift / 
PostgreSQL 
or 
Clojure-based 
event 
collector 
• Allow Snowplow 
users to set a 
third-party cookie 
with a user ID 
• Important for ad 
networks, widget 
companies, multi-domain 
retailers 
• Because 
Snowplow users 
wanted a much 
faster query loop 
than 
HiveQL/MapReduc 
e 
• We wanted a 
robust, feature-rich 
framework for 
managing 
validations, 
enrichments etc
So far we have open-sourced a number of different trackers – 
with more planned 
Production ready: 
• JavaScript 
• No-JavaScript (image beacon) 
• Python 
• Lua 
• Arduino 
Beta: 
• Ruby 
• iOS 
• Android 
• Node.js 
In development: 
• .NET 
• PHP
Enrichment process: what is Scalding? 
• Scalding is a Scala API over Cascading, the Java framework for building 
data processing pipelines on Hadoop: 
Scalding Cascalog PyCascading 
cascading. 
jruby 
Cascading Hive Pig 
Java 
Hadoop MapReduce 
Hadoop DFS
Our “enrichment process” (formerly known as ETL) actually does 
two things: validation and enrichment 
• Our validation model looks like this: 
Raw events 
“Bad” raw 
events + 
reasons why 
they are bad 
Enrichment 
Manager 
“Good” 
enriched 
events 
• Under the covers, we use a lot of monadic Scala (Scalaz) code
Adding the enrichments that web analysts expect = very 
important to Snowplow uptake 
• Web analysts are used to a very specific set of enrichments from Google 
Analytics, Site Catalyst etc 
• These enrichments have evolved over the past 15-20 years and are very domain 
specific: 
• Page querystring -> marketing campaign information (utm_ fields) 
• Referer data -> search engine name, country, keywords 
• IP address -> geographical location 
• Useragent -> browser, OS, computer information
Ongoing evolution of 
Snowplow
There are three big aspects to Snowplow’s 2014 roadmap 
1. Make Snowplow work for non-web (e.g. mobile, IoT) environments as well as 
the web – RELEASED 
2. Make Snowplow work with users’ JSON events as well as with our pre-defined 
events (aka page views, ecommerce transactions etc) – RELEASED 
3. Move Snowplow away from an S3-based data pipeline to a unified log 
(Kinesis/Kafka)-based data pipeline – ONGOING 
Snowplow is developing into an event analytics platform (not 
just a web analytics platform) 
Data warehouse 
Collect event data 
from any connected 
device
Web analysts work with a small number of event types – outside 
of web, the number of possible event types is… infinite 
Web events 
• Page view • Page activity • Order • Add to basket 
All events 
• Game saved • Car started • Machine broke 
• Spellcheck run • Fridge empty • Screenshot taken 
• App crashed • SMS sent • Disk full 
• Screen viewed • Player died • Tweet drafted 
• Till opened • Product returned ∞ 
• Taxi arrived • Cluster started • Phonecall ended
As we get further away from the web, we needed to start 
supporting user’s own JSON events 
• Specifically, events represented as JSONs with arbitrary name: value pairs 
(arbitrary to Snowplow, not to the company using Snowplow!)
Supporting a fixed set of web events and JSON events is a 
difficult problem 
• Almost everybody in event analytics falls on one or other side of this divide: 
Fixed set of web events (page views etc) 
+ custom variables 
Send anything JSONs
We wanted to bridge that divide, making it so that 
Snowplow comes with structured events “out of the box”, 
but is extensible with unstructured events 
Fixed set of web events (page views etc) 
+ custom variables 
Send anything JSONs
Issues with the event name: 
• Separate from the event 
properties 
• Not versioned 
• Not unique – HBO video played 
versus Brightcove video played 
Lots of unanswered questions about the 
properties: 
• Is length required, and is it always a 
number? 
• Is id required, and is it always a string? 
• What other optional properties are allowed 
for a video play? 
Other issues: 
• What if the developer 
accidentally starts 
sending “len” instead of 
“length”? The data will 
end up split across two 
separate fields 
• Why does the analyst 
need to keep an implicit 
schema in their head to 
analyze video played 
events?
MixPanel et al cause “schema loss”
We decided to use JSON Schema, with additional metadata 
about what the schema represents
From a tracker, you send in a JSON which is self-describing, with 
a schema header and data body
iglu:com.channel2.vod/video_played/jsonschema/1-0-0 
Schema format 
Event name 
The vendor of this event 
We are calling our schema repository technology Iglu 
Schema 
version 
Anatomy of an Iglu schema URI
To add this to Snowplow, we developed a new schema 
repository called Iglu, and a shredding step in Hadoop
JSON Schema just gives us a data structure for events – we are 
also evolving a grammar to capture the semantics of events 
Subject 
Direct 
Object 
Indirect 
Object 
Verb 
Event Context 
Prep. 
~ Object
In parallel, we plan to evolve Snowplow from an event analytics 
platform into a “digital nervous system” for data driven 
companies 
• The event data fed into Snowplow is written into a “Unified Log” 
• This becomes the “single source of truth”, upstream from the datawarehouse 
• The same source of truth is used for real-time data processing as analytics e.g. 
• Product recommendations 
• Ad targeting 
• Real-time website personalisation 
• Systems monitoring 
Snowplow will drive data-driven processes as well as off-line 
analytics
Some background on unified log based architectures 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL LOOPS 
E-comm 
Silo 
CRM 
SAAS VENDOR #2 
Email 
marketing 
ERP 
Silo 
CMS 
Silo 
SAAS VENDOR #1 
NARROW DATA SILOES 
Streaming APIs / 
web hooks 
Unified log 
Archiving 
Hadoop 
< WIDE DATA 
COVERAGE > 
< FULL DATA 
HISTORY > 
Systems 
monitoring 
Eventstream 
HIGH LATENCY LOW LATENCY 
Product rec’s 
Ad hoc 
analytics 
Management 
reporting 
Fraud 
detection 
Churn 
prevention 
APIs
We are part way through our Kinesis support, with additional 
components being released soon 
Scala Stream 
Collector 
Raw event 
stream 
Enrich 
Kinesis app 
Bad raw 
events stream 
Enriched 
event 
stream 
S3 
Redshift 
S3 sink Kinesis 
app 
Redshift sink 
Kinesis app 
Snowplow 
Trackers 
• The parts in grey are still 
under development – we 
are working with 
Snowplow community 
members on these 
collaboratively 
• We are also starting 
work on support for 
Apache Kafka alongside 
Kinesis – for users who 
don’t want to run 
Snowplow on AWS
Questions? 
ulogprugcf (43% off Unified Log 
Processing eBook) 
http://snowplowanalytics.com 
https://github.com/snowplow/snowplow 
@snowplowdata 
I am in Berlin tomorrow – to meet up or chat, @alexcrdean on 
Twitter or alex@snowplowanalytics.com

Weitere ähnliche Inhalte

Was ist angesagt?

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modelingyalisassoon
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and CascalogRobert Boland
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowGiuseppe Gaviani
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we doyalisassoon
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSGiuseppe Gaviani
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessyalisassoon
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comyalisassoon
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016yalisassoon
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
Data to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupData to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupJerome Boulon
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Amazon Web Services
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow Analytics
 

Was ist angesagt? (20)

2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and Cascalog
 
How to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using SnowplowHow to evolve your analytics stack with your business using Snowplow
How to evolve your analytics stack with your business using Snowplow
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
 
Snowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWSSnowplow: open source game analytics powered by AWS
Snowplow: open source game analytics powered by AWS
 
Snowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your businessSnowplow: evolve your analytics stack with your business
Snowplow: evolve your analytics stack with your business
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.com
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016
 
Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...Implementing improved and consistent arbitrary event tracking company-wide us...
Implementing improved and consistent arbitrary event tracking company-wide us...
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Data to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream MeetupData to Drive Decision-Making - CaliStream Meetup
Data to Drive Decision-Making - CaliStream Meetup
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 

Ähnlich wie Big Data Beers - Introducing Snowplow

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsDamien Dallimore
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_DataNAVER D2
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationBen Stopford
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - AnalyticsDassana Wijesekara
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateSrinath Perera
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionFrom BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionRittman Analytics
 

Ähnlich wie Big Data Beers - Introducing Snowplow (20)

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices Generation
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake EditionFrom BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
 

Mehr von Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 

Mehr von Alexander Dean (7)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified log
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 

Kürzlich hochgeladen

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Kürzlich hochgeladen (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Big Data Beers - Introducing Snowplow

  • 1. Introducing Snowplow Big Data Beers, Berlin Huge thanks to Zalando for hosting!
  • 2. Snowplow is an open-source web and event analytics platform, first version released in early 2012 • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We started working full time on Snowplow in summer 2013
  • 3. At Keplar, we grew frustrated by significant limitations in traditional web analytics programs Data collection Data processing Data access • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  • 4. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner CloudFront Amazon S3 Amazon EMR Amazon Redshift These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
  • 5. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data pipeline Data warehouse Analyse your data in any analysis tool
  • 6. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers 2. Collectors A B 3. Enrich C 4. Storage D 5. Analytics Generate event data from any environment Launched with: • JavaScript tracker Log raw events from trackers Launched with: • CloudFront collector Validate and enrich raw events Launched with: • HiveQL + Java UDF-based enrichment D = Standardised data protocols Store enriched events ready for analysis Launched with: • Amazon S3 Analyze enriched events Launched with: • HiveQL recipes These turned out to be critical to allowing us to evolve the above stack
  • 7. Our initial skunkworks version of Snowplow – it was basic but it worked, and we started getting traction Website / webapp Snowplow data pipeline v1 (spring 2012) CloudFront-based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker
  • 8. What did people start using it for? Warehousing their web event data To enable… Agile aka ad hoc analytics Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud Product recommendations
  • 9. Current Snowplow design and architecture
  • 10. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector
  • 11. Our protocol-first, loosely-coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 (spring 2013) CloudFront-based event collector Scalding-based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon S3 Amazon Redshift / PostgreSQL or Clojure-based event collector • Allow Snowplow users to set a third-party cookie with a user ID • Important for ad networks, widget companies, multi-domain retailers • Because Snowplow users wanted a much faster query loop than HiveQL/MapReduc e • We wanted a robust, feature-rich framework for managing validations, enrichments etc
  • 12. So far we have open-sourced a number of different trackers – with more planned Production ready: • JavaScript • No-JavaScript (image beacon) • Python • Lua • Arduino Beta: • Ruby • iOS • Android • Node.js In development: • .NET • PHP
  • 13. Enrichment process: what is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog PyCascading cascading. jruby Cascading Hive Pig Java Hadoop MapReduce Hadoop DFS
  • 14. Our “enrichment process” (formerly known as ETL) actually does two things: validation and enrichment • Our validation model looks like this: Raw events “Bad” raw events + reasons why they are bad Enrichment Manager “Good” enriched events • Under the covers, we use a lot of monadic Scala (Scalaz) code
  • 15. Adding the enrichments that web analysts expect = very important to Snowplow uptake • Web analysts are used to a very specific set of enrichments from Google Analytics, Site Catalyst etc • These enrichments have evolved over the past 15-20 years and are very domain specific: • Page querystring -> marketing campaign information (utm_ fields) • Referer data -> search engine name, country, keywords • IP address -> geographical location • Useragent -> browser, OS, computer information
  • 17. There are three big aspects to Snowplow’s 2014 roadmap 1. Make Snowplow work for non-web (e.g. mobile, IoT) environments as well as the web – RELEASED 2. Make Snowplow work with users’ JSON events as well as with our pre-defined events (aka page views, ecommerce transactions etc) – RELEASED 3. Move Snowplow away from an S3-based data pipeline to a unified log (Kinesis/Kafka)-based data pipeline – ONGOING 
  • 18. Snowplow is developing into an event analytics platform (not just a web analytics platform) Data warehouse Collect event data from any connected device
  • 19. Web analysts work with a small number of event types – outside of web, the number of possible event types is… infinite Web events • Page view • Page activity • Order • Add to basket All events • Game saved • Car started • Machine broke • Spellcheck run • Fridge empty • Screenshot taken • App crashed • SMS sent • Disk full • Screen viewed • Player died • Tweet drafted • Till opened • Product returned ∞ • Taxi arrived • Cluster started • Phonecall ended
  • 20. As we get further away from the web, we needed to start supporting user’s own JSON events • Specifically, events represented as JSONs with arbitrary name: value pairs (arbitrary to Snowplow, not to the company using Snowplow!)
  • 21. Supporting a fixed set of web events and JSON events is a difficult problem • Almost everybody in event analytics falls on one or other side of this divide: Fixed set of web events (page views etc) + custom variables Send anything JSONs
  • 22. We wanted to bridge that divide, making it so that Snowplow comes with structured events “out of the box”, but is extensible with unstructured events Fixed set of web events (page views etc) + custom variables Send anything JSONs
  • 23. Issues with the event name: • Separate from the event properties • Not versioned • Not unique – HBO video played versus Brightcove video played Lots of unanswered questions about the properties: • Is length required, and is it always a number? • Is id required, and is it always a string? • What other optional properties are allowed for a video play? Other issues: • What if the developer accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields • Why does the analyst need to keep an implicit schema in their head to analyze video played events?
  • 24. MixPanel et al cause “schema loss”
  • 25. We decided to use JSON Schema, with additional metadata about what the schema represents
  • 26. From a tracker, you send in a JSON which is self-describing, with a schema header and data body
  • 27. iglu:com.channel2.vod/video_played/jsonschema/1-0-0 Schema format Event name The vendor of this event We are calling our schema repository technology Iglu Schema version Anatomy of an Iglu schema URI
  • 28. To add this to Snowplow, we developed a new schema repository called Iglu, and a shredding step in Hadoop
  • 29. JSON Schema just gives us a data structure for events – we are also evolving a grammar to capture the semantics of events Subject Direct Object Indirect Object Verb Event Context Prep. ~ Object
  • 30. In parallel, we plan to evolve Snowplow from an event analytics platform into a “digital nervous system” for data driven companies • The event data fed into Snowplow is written into a “Unified Log” • This becomes the “single source of truth”, upstream from the datawarehouse • The same source of truth is used for real-time data processing as analytics e.g. • Product recommendations • Ad targeting • Real-time website personalisation • Systems monitoring Snowplow will drive data-driven processes as well as off-line analytics
  • 31. Some background on unified log based architectures CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  • 32. We are part way through our Kinesis support, with additional components being released soon Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively • We are also starting work on support for Apache Kafka alongside Kinesis – for users who don’t want to run Snowplow on AWS
  • 33. Questions? ulogprugcf (43% off Unified Log Processing eBook) http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata I am in Berlin tomorrow – to meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com