SlideShare a Scribd company logo
1 of 33
Building data processing apps in
Scala: the Snowplow experience
London Scala Users’ Group
Building data processing apps in Scala

1.

Snowplow – what is it?

2.

Snowplow and Scala

3.

Deep dive into our Scala code

4.

Modularization and non-Snowplow code you can use

5.

Roadmap

6.

Questions

7.

Appendix: even more roadmap
Snowplow – what is it?
Today, Snowplow is primarily an open source web analytics
platform
Snowplow: data pipeline
Website / webapp
Amazon S3

Collect

Transform
and enrich

Amazon
Redshift /
PostgreSQL

• Your granular, event-level and customer-level data,
in your own data warehouse
• Connect any analytics tool to your data
• Join your web analytics data with any other data set
Snowplow was born out of our frustration with traditional web
analytics tools…
• Limited set of reports that don’t answer business questions
•
•
•
•

Traffic levels by source
Conversion levels
Bounce rates
Pages / visit

• Web analytics tools don’t understand the entities that
matter to business
• Customers, intentions, behaviours, articles, videos, authors,
subjects, services…
• …vs pages, conversions, goals, clicks, transactions

• Web analytics tools are siloed
• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data
(cost of goods, customer lifetime value)
…and out of the opportunities to tame new “big data”
technologies

These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Snowplow is composed of a set of loosely coupled subsystems,
architected to be robust and scalable
1. Trackers

A

2. Collectors

B

3. Enrich

C

4. Storage

D

5. Analytics

Generate event
data

Receive data
from trackers
and log it to S3

Clean and
enrich raw data

Store data
ready for
analysis

Examples:
• JavaScript
tracker
• Python /
Lua / No-JS
/ Arduino
tracker

Examples:
• Cloudfront
collector
• Clojure
collector for
Amazon EB

Built on
Scalding /
Cascading /
Hadoop and
powered by
Amazon EMR

Examples:
• Amazon
Redshift
• PostgreSQL
• Amazon S3

• Batch-based A D Standardised data protocols
• Normally run overnight; sometimes
every 4-6 hours
Snowplow and Scala
Our initial skunkworks version of Snowplow had no Scala 
Snowplow data pipeline v1
Website / webapp

JavaScript
event tracker

CloudFrontbased pixel
collector

HiveQL +
Java UDF
“ETL”

Amazon S3
But our schema-first, loosely coupled approach made it possible
to start swapping out existing components…
Snowplow data pipeline v2
Website / webapp
Amazon S3
CloudFrontbased event
collector
JavaScript
event tracker

or

Scaldingbased
enrichment

Clojurebased event
collector

HiveQL +
Java UDF
“ETL”

Amazon
Redshift /
PostgreSQL
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:

Scalding

Cascalog

cascading.
jruby

PyCascading

Cascading

Hive

Java

Hadoop MapReduce

Hadoop DFS

Pig
We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce
Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• More controversial opinion (although maybe not at a
Scala UG): we believe that data pipelines should be as
strongly typed as possible – all the other DSLs/APIs on
top of Cascading encourage dynamic typing
Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:
Deep dive into our Scala
code
The secret sauce for data processing in Scala: the Scalaz
Validation (1/3)
• Our basic processing model for Snowplow looks like this:

Raw events

Scalding
enrichment
process

“Bad” raw
events +
reasons why
they are bad

“Good”
enriched
events

• This fits incredibly well onto the Validation applicative functor from the Scalaz
project
The secret sauce for data processing in Scala: the Scalaz
Validation (2/3)
• We were able to express our data flow in terms of some relatively simple types:
The secret sauce for data processing in Scala: the Scalaz
Validation (3/3)
• Scalaz Validation lets us do a variety of different validations and enrichments,
and then collate the failures
• This is really powerful!
On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the
mistake of just duplicating the data processing functionality in the test:
… and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by
Haskell’s QuickCheck
• We use it in a few places –
including to generate
unpredictable bad data and
also to validate our new Thrift
schema for raw Snowplow
events:
Build and deployment: we have learnt to love (or at least
peacefully co-exist with) SBT
• .scala based SBT build, not .sbt

• We use sbt assemble to create a fat jar for our Scalding ETL process – with some
custom exclusions to play nicely on Amazon Elastic MapReduce
• Deployment is incredibly easy compared to the pain we have had with our two
Ruby instrumentation apps (EmrEtlRunner and StorageLoader)
Modularization and nonSnowplow code you can use
We try to make our validation and enrichment process as
modular as possible
• This encourages testability and re-use – also it widens the number of
contributors vs this functionality being embedded in Snowplow
• The Enrichment Manager uses external libraries (hosted in a Snowplow
repository) which can be used in non-Snowplow projects:

Enrichment
Manager

Not yet
integrated
We also have a few standalone Scala projects which might be of
interest
• None of these projects assume that you are running Snowplow:
Snowplow roadmap
We want to move Snowplow to a unified log-based architecture
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES

Search

SAAS VENDOR #1

SOME LOW LATENCY LOCAL LOOPS

CMS
Silo

E-comm
Silo

APIs

ERP
Silo

CRM

Silo
Streaming APIs /
web hooks

Eventstream

SAAS VENDOR #2

Unified log

Archiving

Hadoop

HIGH LATENCY

< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >

Email
marketing

Ad hoc
analytics

Product rec’s

Systems
monitoring

Management
reporting

Fraud
detection

Churn
prevention

LOW LATENCY
Again, our schema-first approach is letting us get to this
architecture through a set of baby steps (1/2)
• In 0.8.12 at the start of the year we performed some surgery to de-couple our
core enrichment code from its Scalding harness:
0.8.12

pre-0.8.12

hadoop-etl

scala-hadoopenrich

scala-kinesis-enrich

Record-level
enrichment
functionality
scala-common-enrich
Then in 0.9.0 we released our first new Scala components
leveraging Amazon Kinesis:
Snowplow
Trackers

Scala Stream
Collector

• The parts in grey are still
under development – we
are working with
Snowplow community
members on these
collaboratively

Raw event
stream

S3 sink Kinesis
app

S3

Enrich
Kinesis app

Enriched
event
stream

Redshift sink
Kinesis app

Redshift

Bad raw
events stream
Questions?

http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To have a coffee or beer and talk Scala/data – @alexcrdean or
alex@snowplowanalytics.com
Appendix: even more
roadmap!
Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (1/3)
• Our current approach involves a “Tracker Protocol” which is defined in a wiki
page, processed in the Enrichment Manager and then written out to TSV files for
loading into Redshift and Postgres (see over)
Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (3/3)
• We are planning to replace the existing flow with a JSON Schema-driven
approach:
JSON Schema defining events

1. Define
structure

Raw
events in
JSON
format

2. Validate
events

Enrichment
Manager

3. Define
structure

Enriched
events in
Thrift or
Arvo
format

4. Drive
shredding

Shredder

5. Define
structure

Enriched
events in
TSV ready
for loading
into db

More Related Content

Viewers also liked

Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsMICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfigyalisassoon
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive InspectionLinda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
The Evolution of Scala
The Evolution of ScalaThe Evolution of Scala
The Evolution of ScalaMartin Odersky
 
Scala: functional programming for the imperative mind
Scala: functional programming for the imperative mindScala: functional programming for the imperative mind
Scala: functional programming for the imperative mindSander Mak (@Sander_Mak)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

Viewers also liked (19)

Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Functional Programming in Scala
Functional Programming in ScalaFunctional Programming in Scala
Functional Programming in Scala
 
The Evolution of Scala
The Evolution of ScalaThe Evolution of Scala
The Evolution of Scala
 
Scala: functional programming for the imperative mind
Scala: functional programming for the imperative mindScala: functional programming for the imperative mind
Scala: functional programming for the imperative mind
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

More from Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAlexander Dean
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowAlexander Dean
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 

More from Alexander Dean (11)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified log
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

LSUG talk - Building data processing apps in Scala, the Snowplow experience

  • 1. Building data processing apps in Scala: the Snowplow experience London Scala Users’ Group
  • 2. Building data processing apps in Scala 1. Snowplow – what is it? 2. Snowplow and Scala 3. Deep dive into our Scala code 4. Modularization and non-Snowplow code you can use 5. Roadmap 6. Questions 7. Appendix: even more roadmap
  • 4. Today, Snowplow is primarily an open source web analytics platform Snowplow: data pipeline Website / webapp Amazon S3 Collect Transform and enrich Amazon Redshift / PostgreSQL • Your granular, event-level and customer-level data, in your own data warehouse • Connect any analytics tool to your data • Join your web analytics data with any other data set
  • 5. Snowplow was born out of our frustration with traditional web analytics tools… • Limited set of reports that don’t answer business questions • • • • Traffic levels by source Conversion levels Bounce rates Pages / visit • Web analytics tools don’t understand the entities that matter to business • Customers, intentions, behaviours, articles, videos, authors, subjects, services… • …vs pages, conversions, goals, clicks, transactions • Web analytics tools are siloed • Hard to integrate with other data sets incl. digital (marketing spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)
  • 6. …and out of the opportunities to tame new “big data” technologies These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
  • 7. Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable 1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics Generate event data Receive data from trackers and log it to S3 Clean and enrich raw data Store data ready for analysis Examples: • JavaScript tracker • Python / Lua / No-JS / Arduino tracker Examples: • Cloudfront collector • Clojure collector for Amazon EB Built on Scalding / Cascading / Hadoop and powered by Amazon EMR Examples: • Amazon Redshift • PostgreSQL • Amazon S3 • Batch-based A D Standardised data protocols • Normally run overnight; sometimes every 4-6 hours
  • 9. Our initial skunkworks version of Snowplow had no Scala  Snowplow data pipeline v1 Website / webapp JavaScript event tracker CloudFrontbased pixel collector HiveQL + Java UDF “ETL” Amazon S3
  • 10. But our schema-first, loosely coupled approach made it possible to start swapping out existing components… Snowplow data pipeline v2 Website / webapp Amazon S3 CloudFrontbased event collector JavaScript event tracker or Scaldingbased enrichment Clojurebased event collector HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL
  • 11. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog cascading. jruby PyCascading Cascading Hive Java Hadoop MapReduce Hadoop DFS Pig
  • 12. We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce
  • 13. Why did we choose Scalding instead of one of the other Cascading DSLs/APIs? • Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project) • Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet • More controversial opinion (although maybe not at a Scala UG): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
  • 14. Strongly typed data pipelines – why? • Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Forces you to formerly address the data types flowing through your system • Lets you write code like this:
  • 15. Deep dive into our Scala code
  • 16. The secret sauce for data processing in Scala: the Scalaz Validation (1/3) • Our basic processing model for Snowplow looks like this: Raw events Scalding enrichment process “Bad” raw events + reasons why they are bad “Good” enriched events • This fits incredibly well onto the Validation applicative functor from the Scalaz project
  • 17. The secret sauce for data processing in Scala: the Scalaz Validation (2/3) • We were able to express our data flow in terms of some relatively simple types:
  • 18. The secret sauce for data processing in Scala: the Scalaz Validation (3/3) • Scalaz Validation lets us do a variety of different validations and enrichments, and then collate the failures • This is really powerful!
  • 19. On the testing side: we love Specs2 data tables… • They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:
  • 20. … and are starting to do more with ScalaCheck • ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck • We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:
  • 21. Build and deployment: we have learnt to love (or at least peacefully co-exist with) SBT • .scala based SBT build, not .sbt • We use sbt assemble to create a fat jar for our Scalding ETL process – with some custom exclusions to play nicely on Amazon Elastic MapReduce • Deployment is incredibly easy compared to the pain we have had with our two Ruby instrumentation apps (EmrEtlRunner and StorageLoader)
  • 22. Modularization and nonSnowplow code you can use
  • 23. We try to make our validation and enrichment process as modular as possible • This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow • The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects: Enrichment Manager Not yet integrated
  • 24. We also have a few standalone Scala projects which might be of interest • None of these projects assume that you are running Snowplow:
  • 26. We want to move Snowplow to a unified log-based architecture CLOUD VENDOR / OWN DATA CENTER NARROW DATA SILOES Search SAAS VENDOR #1 SOME LOW LATENCY LOCAL LOOPS CMS Silo E-comm Silo APIs ERP Silo CRM Silo Streaming APIs / web hooks Eventstream SAAS VENDOR #2 Unified log Archiving Hadoop HIGH LATENCY < WIDE DATA COVERAGE > < FULL DATA HISTORY > Email marketing Ad hoc analytics Product rec’s Systems monitoring Management reporting Fraud detection Churn prevention LOW LATENCY
  • 27. Again, our schema-first approach is letting us get to this architecture through a set of baby steps (1/2) • In 0.8.12 at the start of the year we performed some surgery to de-couple our core enrichment code from its Scalding harness: 0.8.12 pre-0.8.12 hadoop-etl scala-hadoopenrich scala-kinesis-enrich Record-level enrichment functionality scala-common-enrich
  • 28. Then in 0.9.0 we released our first new Scala components leveraging Amazon Kinesis: Snowplow Trackers Scala Stream Collector • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively Raw event stream S3 sink Kinesis app S3 Enrich Kinesis app Enriched event stream Redshift sink Kinesis app Redshift Bad raw events stream
  • 29. Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To have a coffee or beer and talk Scala/data – @alexcrdean or alex@snowplowanalytics.com
  • 31. Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (1/3) • Our current approach involves a “Tracker Protocol” which is defined in a wiki page, processed in the Enrichment Manager and then written out to TSV files for loading into Redshift and Postgres (see over)
  • 32.
  • 33. Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (3/3) • We are planning to replace the existing flow with a JSON Schema-driven approach: JSON Schema defining events 1. Define structure Raw events in JSON format 2. Validate events Enrichment Manager 3. Define structure Enriched events in Thrift or Arvo format 4. Drive shredding Shredder 5. Define structure Enriched events in TSV ready for loading into db