I talked to the London Scala Users' Group about building Snowplow, an open source event analytics platform, on top of Scala and key libraries and frameworks including Scalding, Scalaz and Spray. He will highlight some of the data processing tricks and techniques picked up along the way, particularly: schema-first development; monadic ETL; datatable-based testing; data transformation maps. He will also introduce some of the Scala libraries the Snowplow team have open sourced along the way (such as scala-forex, referer-parser, scala-maxmind-geoip).
2. Building data processing apps in Scala
1.
Snowplow – what is it?
2.
Snowplow and Scala
3.
Deep dive into our Scala code
4.
Modularization and non-Snowplow code you can use
5.
Roadmap
6.
Questions
7.
Appendix: even more roadmap
4. Today, Snowplow is primarily an open source web analytics
platform
Snowplow: data pipeline
Website / webapp
Amazon S3
Collect
Transform
and enrich
Amazon
Redshift /
PostgreSQL
• Your granular, event-level and customer-level data,
in your own data warehouse
• Connect any analytics tool to your data
• Join your web analytics data with any other data set
5. Snowplow was born out of our frustration with traditional web
analytics tools…
• Limited set of reports that don’t answer business questions
•
•
•
•
Traffic levels by source
Conversion levels
Bounce rates
Pages / visit
• Web analytics tools don’t understand the entities that
matter to business
• Customers, intentions, behaviours, articles, videos, authors,
subjects, services…
• …vs pages, conversions, goals, clicks, transactions
• Web analytics tools are siloed
• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data
(cost of goods, customer lifetime value)
6. …and out of the opportunities to tame new “big data”
technologies
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
7. Snowplow is composed of a set of loosely coupled subsystems,
architected to be robust and scalable
1. Trackers
A
2. Collectors
B
3. Enrich
C
4. Storage
D
5. Analytics
Generate event
data
Receive data
from trackers
and log it to S3
Clean and
enrich raw data
Store data
ready for
analysis
Examples:
• JavaScript
tracker
• Python /
Lua / No-JS
/ Arduino
tracker
Examples:
• Cloudfront
collector
• Clojure
collector for
Amazon EB
Built on
Scalding /
Cascading /
Hadoop and
powered by
Amazon EMR
Examples:
• Amazon
Redshift
• PostgreSQL
• Amazon S3
• Batch-based A D Standardised data protocols
• Normally run overnight; sometimes
every 4-6 hours
9. Our initial skunkworks version of Snowplow had no Scala
Snowplow data pipeline v1
Website / webapp
JavaScript
event tracker
CloudFrontbased pixel
collector
HiveQL +
Java UDF
“ETL”
Amazon S3
10. But our schema-first, loosely coupled approach made it possible
to start swapping out existing components…
Snowplow data pipeline v2
Website / webapp
Amazon S3
CloudFrontbased event
collector
JavaScript
event tracker
or
Scaldingbased
enrichment
Clojurebased event
collector
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
11. What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Scalding
Cascalog
cascading.
jruby
PyCascading
Cascading
Hive
Java
Hadoop MapReduce
Hadoop DFS
Pig
12. We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce
13. Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• More controversial opinion (although maybe not at a
Scala UG): we believe that data pipelines should be as
strongly typed as possible – all the other DSLs/APIs on
top of Cascading encourage dynamic typing
14. Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
16. The secret sauce for data processing in Scala: the Scalaz
Validation (1/3)
• Our basic processing model for Snowplow looks like this:
Raw events
Scalding
enrichment
process
“Bad” raw
events +
reasons why
they are bad
“Good”
enriched
events
• This fits incredibly well onto the Validation applicative functor from the Scalaz
project
17. The secret sauce for data processing in Scala: the Scalaz
Validation (2/3)
• We were able to express our data flow in terms of some relatively simple types:
18. The secret sauce for data processing in Scala: the Scalaz
Validation (3/3)
• Scalaz Validation lets us do a variety of different validations and enrichments,
and then collate the failures
• This is really powerful!
19. On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the
mistake of just duplicating the data processing functionality in the test:
20. … and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by
Haskell’s QuickCheck
• We use it in a few places –
including to generate
unpredictable bad data and
also to validate our new Thrift
schema for raw Snowplow
events:
21. Build and deployment: we have learnt to love (or at least
peacefully co-exist with) SBT
• .scala based SBT build, not .sbt
• We use sbt assemble to create a fat jar for our Scalding ETL process – with some
custom exclusions to play nicely on Amazon Elastic MapReduce
• Deployment is incredibly easy compared to the pain we have had with our two
Ruby instrumentation apps (EmrEtlRunner and StorageLoader)
23. We try to make our validation and enrichment process as
modular as possible
• This encourages testability and re-use – also it widens the number of
contributors vs this functionality being embedded in Snowplow
• The Enrichment Manager uses external libraries (hosted in a Snowplow
repository) which can be used in non-Snowplow projects:
Enrichment
Manager
Not yet
integrated
24. We also have a few standalone Scala projects which might be of
interest
• None of these projects assume that you are running Snowplow:
26. We want to move Snowplow to a unified log-based architecture
CLOUD VENDOR / OWN DATA CENTER
NARROW DATA SILOES
Search
SAAS VENDOR #1
SOME LOW LATENCY LOCAL LOOPS
CMS
Silo
E-comm
Silo
APIs
ERP
Silo
CRM
Silo
Streaming APIs /
web hooks
Eventstream
SAAS VENDOR #2
Unified log
Archiving
Hadoop
HIGH LATENCY
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Email
marketing
Ad hoc
analytics
Product rec’s
Systems
monitoring
Management
reporting
Fraud
detection
Churn
prevention
LOW LATENCY
27. Again, our schema-first approach is letting us get to this
architecture through a set of baby steps (1/2)
• In 0.8.12 at the start of the year we performed some surgery to de-couple our
core enrichment code from its Scalding harness:
0.8.12
pre-0.8.12
hadoop-etl
scala-hadoopenrich
scala-kinesis-enrich
Record-level
enrichment
functionality
scala-common-enrich
28. Then in 0.9.0 we released our first new Scala components
leveraging Amazon Kinesis:
Snowplow
Trackers
Scala Stream
Collector
• The parts in grey are still
under development – we
are working with
Snowplow community
members on these
collaboratively
Raw event
stream
S3 sink Kinesis
app
S3
Enrich
Kinesis app
Enriched
event
stream
Redshift sink
Kinesis app
Redshift
Bad raw
events stream
31. Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (1/3)
• Our current approach involves a “Tracker Protocol” which is defined in a wiki
page, processed in the Enrichment Manager and then written out to TSV files for
loading into Redshift and Postgres (see over)
32.
33. Separately, we want to re-architect our data processing pipeline
to make it even more schema’ed! (3/3)
• We are planning to replace the existing flow with a JSON Schema-driven
approach:
JSON Schema defining events
1. Define
structure
Raw
events in
JSON
format
2. Validate
events
Enrichment
Manager
3. Define
structure
Enriched
events in
Thrift or
Arvo
format
4. Drive
shredding
Shredder
5. Define
structure
Enriched
events in
TSV ready
for loading
into db