LSUG talk - Building data processing apps in Scala, the Snowplow experience

Building data processing apps in
Scala: the Snowplow experience
London Scala Users’ Group

Building data processing apps in Scala

1.

Snowplow – what is it?

2.

Snowplow and Scala

3.

Deep dive into our Scala code

4.

Modularization and non-Snowplow code you can use

5.

Roadmap

6.

Questions

7.

Appendix: even more roadmap

Today, Snowplow is primarily an open source web analytics
platform
Snowplow: data pipeline
Website / webapp
Amazon S3

Collect

Transform
and enrich

Amazon
Redshift /
PostgreSQL

• Your granular, event-level and customer-level data,
in your own data warehouse
• Connect any analytics tool to your data
• Join your web analytics data with any other data set

Snowplow was born out of our frustration with traditional web
analytics tools…
• Limited set of reports that don’t answer business questions
•
•
•
•

Traffic levels by source
Conversion levels
Bounce rates
Pages / visit

• Web analytics tools don’t understand the entities that
matter to business
• Customers, intentions, behaviours, articles, videos, authors,
subjects, services…
• …vs pages, conversions, goals, clicks, transactions

• Web analytics tools are siloed
• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data
(cost of goods, customer lifetime value)

…and out of the opportunities to tame new “big data”
technologies

These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis

Snowplow is composed of a set of loosely coupled subsystems,
architected to be robust and scalable
1. Trackers

A

2. Collectors

B

3. Enrich

C

4. Storage

D

5. Analytics

Generate event
data

Receive data
from trackers
and log it to S3

Clean and
enrich raw data

Store data
ready for
analysis

Examples:
• JavaScript
tracker
• Python /
Lua / No-JS
/ Arduino
tracker

Examples:
• Cloudfront
collector
• Clojure
collector for
Amazon EB

Built on
Scalding /
Cascading /
Hadoop and
powered by
Amazon EMR

Examples:
• Amazon
Redshift
• PostgreSQL
• Amazon S3

• Batch-based A D Standardised data protocols
• Normally run overnight; sometimes
every 4-6 hours

Our initial skunkworks version of Snowplow had no Scala 
Snowplow data pipeline v1
Website / webapp

JavaScript
event tracker

CloudFrontbased pixel
collector

HiveQL +
Java UDF
“ETL”

Amazon S3

But our schema-first, loosely coupled approach made it possible
to start swapping out existing components…
Snowplow data pipeline v2
Website / webapp
Amazon S3
CloudFrontbased event
collector
JavaScript
event tracker

or

Scaldingbased
enrichment

Clojurebased event
collector

HiveQL +
Java UDF
“ETL”

Amazon
Redshift /
PostgreSQL

What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:

Scalding

Cascalog

cascading.
jruby

PyCascading

Cascading

Hive

Java

Hadoop MapReduce

Hadoop DFS

Pig

We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce

Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• More controversial opinion (although maybe not at a
Scala UG): we believe that data pipelines should be as
strongly typed as possible – all the other DSLs/APIs on
top of Cascading encourage dynamic typing

Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:

The secret sauce for data processing in Scala: the Scalaz
Validation (1/3)
• Our basic processing model for Snowplow looks like this:

Raw events

Scalding
enrichment
process

“Bad” raw
events +
reasons why
they are bad

“Good”
enriched
events

• This fits incredibly well onto the Validation applicative functor from the Scalaz
project

Validation (2/3)
• We were able to express our data flow in terms of some relatively simple types:

Validation (3/3)
• Scalaz Validation lets us do a variety of different validations and enrichments,
and then collate the failures
• This is really powerful!

On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the
mistake of just duplicating the data processing functionality in the test:

… and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by
Haskell’s QuickCheck
• We use it in a few places –
including to generate
unpredictable bad data and
also to validate our new Thrift
schema for raw Snowplow
events:

Build and deployment: we have learnt to love (or at least
peacefully co-exist with) SBT
• .scala based SBT build, not .sbt

• We use sbt assemble to create a fat jar for our Scalding ETL process – with some
custom exclusions to play nicely on Amazon Elastic MapReduce
• Deployment is incredibly easy compared to the pain we have had with our two
Ruby instrumentation apps (EmrEtlRunner and StorageLoader)

Modularization and nonSnowplow code you can use

We try to make our validation and enrichment process as
modular as possible
• This encourages testability and re-use – also it widens the number of
contributors vs this functionality being embedded in Snowplow
• The Enrichment Manager uses external libraries (hosted in a Snowplow
repository) which can be used in non-Snowplow projects:

Enrichment
Manager

Not yet
integrated

We also have a few standalone Scala projects which might be of
interest
• None of these projects assume that you are running Snowplow:

We want to move Snowplow to a unified log-based architecture
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES

Search

SAAS VENDOR #1

SOME LOW LATENCY LOCAL LOOPS

CMS
Silo

E-comm
Silo

APIs

ERP
Silo

CRM

Silo
Streaming APIs /
web hooks

Eventstream

SAAS VENDOR #2

Unified log

Archiving

Hadoop

HIGH LATENCY

< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >

Email
marketing

Ad hoc
analytics

Product rec’s

Systems
monitoring

Management
reporting

Fraud
detection

Churn
prevention

LOW LATENCY

Again, our schema-first approach is letting us get to this
architecture through a set of baby steps (1/2)
• In 0.8.12 at the start of the year we performed some surgery to de-couple our
core enrichment code from its Scalding harness:
0.8.12

pre-0.8.12

hadoop-etl

scala-hadoopenrich

scala-kinesis-enrich

Record-level
enrichment
functionality
scala-common-enrich

Then in 0.9.0 we released our first new Scala components
leveraging Amazon Kinesis:
Snowplow
Trackers

Scala Stream
Collector

• The parts in grey are still
under development – we
are working with
Snowplow community
members on these
collaboratively

Raw event
stream

S3 sink Kinesis
app

S3

Enrich
Kinesis app

Enriched
event
stream

Redshift sink
Kinesis app

Redshift

Bad raw
events stream

Questions?

http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To have a coffee or beer and talk Scala/data – @alexcrdean or
alex@snowplowanalytics.com

Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (1/3)
• Our current approach involves a “Tracker Protocol” which is defined in a wiki
page, processed in the Enrichment Manager and then written out to TSV files for
loading into Redshift and Postgres (see over)

Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (3/3)
• We are planning to replace the existing flow with a JSON Schema-driven
approach:
JSON Schema defining events

1. Define
structure

Raw
events in
JSON
format

2. Validate
events

Enrichment
Manager

3. Define
structure

Enriched
events in
Thrift or
Arvo
format

4. Drive
shredding

Shredder

5. Define
structure

Enriched
events in
TSV ready
for loading
into db

LSUG talk - Building data processing apps in Scala, the Snowplow experience

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

More from Alexander Dean

More from Alexander Dean (11)

Recently uploaded

Recently uploaded (20)

LSUG talk - Building data processing apps in Scala, the Snowplow experience