Schibsted collects and analyzes 900 million events/day using AWS. This presentation gives an overview of the systems and architecture, including the solutions to GDPR.
5. Data Platform team
5
Jordi Roura
Ole-Magnus Røysted Aker
Sangram Bal
Oleksandr Ivanov
Mårten Rånge
Håvard Wall
Bjørn Rustad
Rolv Seehus
Håkon Åmdal
Fredrik Vraalsen
Øyvind Løkling
Per Wessel Nore
Lars Marius Garshol
Ning Zhou
9. The Batch Job
• Implemented in Apache Spark
• Luigi for scheduling/orchestration
• Runs in a shared Mesos cluster
• this was set up because letting users create individual
clusters became far too expensive
• this cluster is the main cost driver for Schibsted’s AWS bills
• Difficult environment to work with
• hard to debug and develop in
9
10. Problems with batch
• Configuration file mapped to Spark tasks
• very complex set of Spark tasks
• requires lots of communication between Spark nodes
• runs slowly
• Very resource-intensive
• had difficulty keeping up with incoming traffic
• very sensitive to “cluster weather”
• brittle
10
11. Piper & Storage
• Ordinary Java and Scala applications
• read from the data source, perform all processing on one
node, then write to the destination
• no communication necessary between nodes
• normal EC2 nodes with the application baked into the AMI
• therefore scales trivially with Auto Scaling Groups
• Instrumented with Datadog, logs loaded into
SumoLogic
11
13. Kafka vs Kinesis
• Kinesis has very strict API limits
• total read limit = 2x write limit
• effectively limited to 2 readers
• Kinesis API is very limited
• basically only supports reading records in order
• Kafka improves on both
• can support many readers simultaneously
• advanced Scala DSL with window functions etc etc
13
15. Handling slow consumers
15
Kafka Yggdrasil
Firehose
One topic per consumer
Duratro
One topic per consumer
OK
OK
OK
OK
Slow
Transforms
Filtering
16. Pulse challenges
• Pulse is a tracking solution with no user interface
• you want dashboards to analyze user traffic? sorry
• problem is: not enough resources to develop that
• Using Amplitude to solve that
• created a Duratro sink for Amplitude
• simple HTTP POST of JSON events to Amplitude API
• users can now create Amplitude projects, feed their Pulse
events there, and finally have dashboards
16
17. Transforms
• Because GDPR we need to anonymize most incoming
data formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one
format to another
• Pulse to Amplitude
• ClickMeter to Pulse
• Convert data to match database structures
• …
17
18. Who configures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Transforms require domain knowledge
• each site has its own specialities in Pulse tracking
• to transform these correctly requires knowing all this
18
21. Duratro: config
pipes {
ATEDev {
sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1"
sink {
type = "kinesis"
stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG"
region = "eu-west-1"
role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0"
session = "kinesis-ate-dev"
}
}
21
22. Not in great shape
• Transforms were written in Scala code
• not easy to read even for Scala developers
• most site devs are not Scala developers
• Config changes require deploys
• in streaming, matching changes must be made to both
Yggdrasil and Duratro
• Three different configuration syntaxes
• definition of same type of event different in batch &
streaming
22
23. What if?
• We had an expression language for JSON, kind of
like jq
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
23
24. JSLT
• Custom language for JSON transforms & queries
• First iteration
• JSON syntax with ${ … } wrappers for jq expressions
• very simple additions: let, for and if expressions
• tried out, worked well, but not ideal
• Second iteration
• own language from the ground up
• far better performance
• easier to write and use
24
25. JSLT expressions
25
.foo Get “foo” key from input object
.foo.bar As above + .bar inside that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
27. More features
• [for (.array) number(.) * 1.1]
• convert each element in an array
• * : .
• object matcher, keeps rest of object unchanged
• {for (.object) “prefix” + .key : .value}
• dynamic rewrite of object
• def func(p1, p2)
• define custom functions
27
28. Benefits of JSLT
• Easier to read and write than code
• Doesn’t require user to know Scala
• Can be embedded in configuration
• Flexible enough to support 99-100% of filters/
transforms
• Performance quite good
• 5-10x original language based on jackson-jq
28
29.
30. Routing language
Firehose:
description: All incoming events.
transform: transforms/base-cleanup.jslt
PulseBase:
description: Cleaned-up Pulse events with all the information in them.
baseType: Firehose
filter: import "filters/pulse.jslt" as pulse pulse(.)
transform: transforms/pulse-cleanup.jslt
postFilter: import "filters/pulse-valid.jslt" as valid valid(.)
30
31. Pulse definitions
PulseIdentified:
description: Pulse events with personally identifying information included.
baseType: PulseBase
filter: .actor."spt:userId"
transform: transforms/pulse-identified.jslt
PulseAnonymized:
description: Pulse events with personally identifying information excluded.
baseType: PulseBase
transform: transforms/pulse-anonymized.jslt
31
32. pulse-identified.jslt
let isFiltered = (contains(get-client(.), $filteredProviders))
{
"@id" : if ( ."@id" ) sha256-hex($salt + ."@id"),
"actor" : {
// remove one user identifier, but spt:userId also contains user ID
"@id" : if ( .actor."@id" ) null,
"spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress",
"spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6",
* : .
},
"device" : {
"environmentId" : if ( .device.environmentId ) null,
* : .
},
"location" : if (not($isFiltered)) .location,
* : .
}
32
33. Sinks
VG-ArticleViews-1:
eventType: PulseLoginPreserved
filter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"])
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"])
and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
33
34. routing-lib
• A Scala library that can load the YAML files
• main dependencies: Jackson and JSLT
• One main API method:
• RoutingConfig.route(JsonNode): Seq[(JsonNode, Sink)]
• Used by
• The Batch Job 2.0
• Yggdrasil
• Duratro
34
35. The Batch Job 2.0
• Three simple steps
• read JSON input from S3 (Spark solves this)
• push JSON data through routing-lib
• write JSON back to S3 (Spark solves this)
• There’s a little more to it than that, but that’s the heart of
it
• much better performance (much less data shuffling)
• better performance means it handles “cluster weather” more
robustly
• easier to catch up if we fall behind
35
36. Static routing
• Routing was configuration, packaged as a jar
• Every change required
• make routing PR, merge
• wait for Travis build to upload to Artifactory
• upgrade the batch job, deploy
• upgrade Yggdrasil, deploy
• upgrade Duratro, deploy
36
39. Self-serve
• Finding the right repo and learning a YAML syntax
is non-trivial
• What if users could instead use a user interface?
• select an event type
• pick a transform
• add a filter, if necessary
• then configure a sink
• press the button, and wham!
39
40. YAML format
• Was designed for this right from the start
• Having event-types.yaml separate
• enables reuse across batch and streaming
• but also in selfserve
• Making a flat format based on references
• avoids deep, nested tree structures in syntax
• means config can be merged from many sources
40
46. Schibsted’s setup
• The individual sites are legally data controllers
• that means, they own the data and the responsibility
• Central Schibsted components are data processors
• that means, they do only what the controllers tell them to
• upside: responsibility rests with the controllers
• Has lots of consequences for how things work
46
47. Main issues
• Anonymization: handled with transforms
• Retention: handled with S3 lifecycle policies
• Takeout
• only necessary where we are the primary storage
• Deletion: a bit of a problem
• but we have 30 days to comply
47
53. Data takeout
• Privacy broker posts message on SQS queue
• we take it down to S3, to get Luigi integration
• Luigi starts Spark job
• reads through stored data, looking for that user
• all events from that user are written to S3
• post SQS message back with reference to event and file
• Source data stored in Parquet
• use manually generated index files to avoid processing data
that has no events from this user
53
54. Data deletion
• Data is stored as Parquet files in S3
• but Parquet doesn’t have a “delete” function
• you have to rewrite the dataset
• This is slow and costly
• but can be batched: delete many users at once
• batching so many users that the index is useless
• What if someone is reading the dataset when you
are rewriting it?
54
55. Solution
• bucket/prefix/year=x/month=y/…/gen=0
• data stored under here initially
• bucket/prefix/year=x/month=y/…/gen=1
• data stored here after first rewrite
• once _SUCCESS flag is there consumers must switch
• after a day or so, gen=0 can be deleted
• Janitor Monkey deletes orphan generations & handles
retention
• because deletion rewrites data: messes up last modified
55
56. Data Access Layer
• Building logic to handle gen=x is a pain for users
• the Data Access Layer wraps Spark to do it for them
• Can in the future be expanded to
• filter out rows from users that opt out of certain processing
• do access control on a column level
• …
56
58. Stricter access to data
• Because the sites are data controllers, they must
decide who should have access to what
• access is controlled by IAM policies
• but users can’t write those, and that’s not safe, anyway
• The system essentially requires communication
• data consumers must request data
• data owners (sites) but approve/reject requests
58
59. The granule of access
• We have many datasets
• Pulse (anonymized, identified, …)
• Payments (payment data)
• Content (ad and article content)
• …
• Access is per (dataset, site) combination
• you can have access to VG Pulse, but not Aftenposten
Pulse
59
63. Challenges
• Users can potentially get access to many (dataset,
site) combinations
• each one needs to go into their IAM policies
• IAM has very strict API limits
• user inline policies: total max 2048 bytes
• managed policy size: max 6144
• max managed policies per account: 1500
• max attached managed policies: 10
• max group memberships: 10
63
64. Permission packing
• First pack as much as possible into an inline policy
• Then fill up personal managed policies & attach
• Then create more policies and attach to groups,
then attach those
• We believe we can attach 10,000 datasets this way
64
65. Sync
12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists,
must be cleaned
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from
lars.marius.garshol@schibsted.com
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com
12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com
12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on
lars.marius.garshol@schibsted.com
12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-1 exists, deleting
12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left
12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-2 exists, deleting
12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left
12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role
lars.marius.garshol@schibsted.com 65
67. Slow maturation
• We started out with almost nothing in 2015
• now finally becoming something closer to what we need to be
• New challenges ahead
• management wants to scale up usage of common solutions
dramatically
• legal basis management is coming
• selfserve needs more functionality
• Data Quality Tooling needs an overhaul
• data discovery service likewise
• …
67