SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
Data collection in AWS
Lars Marius Garshol, lars.marius.garshol@schibsted.com
http://twitter.com/larsga
2018–09–04, AWS Meetup
Schibsted?
Collecting data?
What?
Schibsted
3
30 countries
200 million users/month
20 billion pageviews/month
Three parts
4
Data Platform team
5
Jordi Roura
Ole-Magnus Røysted Aker
Sangram Bal
Oleksandr Ivanov
Mårten Rånge
Håvard Wall
Bjørn Rustad
Rolv Seehus
Håkon Åmdal
Fredrik Vraalsen
Øyvind Løkling
Per Wessel Nore
Lars Marius Garshol
Ning Zhou
Data Platform
6
Data Platform
Batch
Streaming
Pulse
Data volume
7
Original architecture
8
Collector
Kinesis
Kinesis
Storage S3
The
Batch
Job
Piper
S3
The Batch Job
• Implemented in Apache Spark
• Luigi for scheduling/orchestration
• Runs in a shared Mesos cluster
• this was set up because letting users create individual
clusters became far too expensive
• this cluster is the main cost driver for Schibsted’s AWS bills
• Difficult environment to work with
• hard to debug and develop in
9
Problems with batch
• Configuration file mapped to Spark tasks
• very complex set of Spark tasks
• requires lots of communication between Spark nodes
• runs slowly
• Very resource-intensive
• had difficulty keeping up with incoming traffic
• very sensitive to “cluster weather”
• brittle
10
Piper & Storage
• Ordinary Java and Scala applications
• read from the data source, perform all processing on one
node, then write to the destination
• no communication necessary between nodes
• normal EC2 nodes with the application baked into the AMI
• therefore scales trivially with Auto Scaling Groups
• Instrumented with Datadog, logs loaded into
SumoLogic
11
OK
Piper’s problem
12
Storage S3
Piper
SQS
OK
OK
OK
OK
Slow
Kafka vs Kinesis
• Kinesis has very strict API limits
• total read limit = 2x write limit
• effectively limited to 2 readers
• Kinesis API is very limited
• basically only supports reading records in order
• Kafka improves on both
• can support many readers simultaneously
• advanced Scala DSL with window functions etc etc
13
New architecture
14
Collector
Kinesis
Kinesis
S3
Storage
S3
The
Batch
Job
S3
Kafka
Storage Kafka ?
Handling slow consumers
15
Kafka Yggdrasil
Firehose
One topic per consumer
Duratro
One topic per consumer
OK
OK
OK
OK
Slow
Transforms
Filtering
Pulse challenges
• Pulse is a tracking solution with no user interface
• you want dashboards to analyze user traffic? sorry
• problem is: not enough resources to develop that
• Using Amplitude to solve that
• created a Duratro sink for Amplitude
• simple HTTP POST of JSON events to Amplitude API
• users can now create Amplitude projects, feed their Pulse
events there, and finally have dashboards
16
Transforms
• Because GDPR we need to anonymize most incoming
data formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one
format to another
• Pulse to Amplitude
• ClickMeter to Pulse
• Convert data to match database structures
• …
17
Who configures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Transforms require domain knowledge
• each site has its own specialities in Pulse tracking
• to transform these correctly requires knowing all this
18
Batch config: 1 sink
{
"driver": "anyoffilter",
"name": "image-classification",
"rules": [
{ "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" },
{ "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" }
],
"onmatch": [
{
"driver": "cache",
"name": "image-classification",
"level": "memory+disk"
},
{
"driver": "demuxer",
"name": "image-classification",
"rules": "${pulseSdrnFilterUri}",
"parallel": true,
"onmatch": {
"driver": "textfilewriter",
"uri": "${imageSiteUri}",
"numFiles": {
"eventsPerFile": 500000,
"max": ${numExecutors}
}
}
}
],
19
Early config was 1838 lines
Yggdrasil: Scala DSL
override def buildTopology(builder: StreamsBuilder): Unit = {
import com.schibsted.spt.data.yggdrasil.serde.YggdrasilImplicitSerdes.{json, strings}
// mads events routing
val madsProEvents = madsEvents.tryFilter(MadsProEventsPredicate, deadLetterQueue("Default"))
val madsPreEvents = madsEvents.tryFilter(MadsPreEventsPredicate, deadLetterQueue("Default"))
val madsDevEvents = madsEvents.tryFilter(new EventSampler(0.01), deadLetterQueue("Default"))
madsProEvents ~> contentTopic("Personalisation-Rocket-Pro")
madsPreEvents ~> contentTopic("Personalisation-Rocket-Pre")
madsDevEvents ~> contentTopic("Personalisation-Rocket-Dev")
madsProEvents ~> providerIdDemuxer(
"^urn:schibsted:madstorage-(rkt|web)-tayara-prod:mp-ads-delivery".r -> contentTopic("Image-
"^urn:schibsted:madstorage-(rkt|web)-corotos-prod:mp-ads-delivery".r -> contentTopic(“Image-
)
20
Duratro: config
pipes {
ATEDev {
sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1"
sink {
type = "kinesis"
stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG"
region = "eu-west-1"
role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0"
session = "kinesis-ate-dev"
}
}
21
Not in great shape
• Transforms were written in Scala code
• not easy to read even for Scala developers
• most site devs are not Scala developers
• Config changes require deploys
• in streaming, matching changes must be made to both
Yggdrasil and Duratro
• Three different configuration syntaxes
• definition of same type of event different in batch &
streaming
22
What if?
• We had an expression language for JSON, kind of
like jq
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
23
JSLT
• Custom language for JSON transforms & queries
• First iteration
• JSON syntax with ${ … } wrappers for jq expressions
• very simple additions: let, for and if expressions
• tried out, worked well, but not ideal
• Second iteration
• own language from the ground up
• far better performance
• easier to write and use
24
JSLT expressions
25
.foo Get “foo” key from input object
.foo.bar As above + .bar inside that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
JSLT transforms
{
“insert_id” : .”@id”,
“event_type” : ."@type" + " " + .object.”@type”,
"device_id" : .device.environmentId,
"time": amp:parse_timestamp(.published),
"device_manufacturer": .device.manufacturer,
"device_model": .device.model,
"language": .device.acceptLanguage,
"os_name": .device.osType,
"os_version": .device.osVersion,
…
}
26
More features
• [for (.array) number(.) * 1.1]
• convert each element in an array
• * : .
• object matcher, keeps rest of object unchanged
• {for (.object) “prefix” + .key : .value}
• dynamic rewrite of object
• def func(p1, p2)
• define custom functions
27
Benefits of JSLT
• Easier to read and write than code
• Doesn’t require user to know Scala
• Can be embedded in configuration
• Flexible enough to support 99-100% of filters/
transforms
• Performance quite good
• 5-10x original language based on jackson-jq
28
Routing language
Firehose:
description: All incoming events.
transform: transforms/base-cleanup.jslt
PulseBase:
description: Cleaned-up Pulse events with all the information in them.
baseType: Firehose
filter: import "filters/pulse.jslt" as pulse pulse(.)
transform: transforms/pulse-cleanup.jslt
postFilter: import "filters/pulse-valid.jslt" as valid valid(.)
30
Pulse definitions
PulseIdentified:
description: Pulse events with personally identifying information included.
baseType: PulseBase
filter: .actor."spt:userId"
transform: transforms/pulse-identified.jslt
PulseAnonymized:
description: Pulse events with personally identifying information excluded.
baseType: PulseBase
transform: transforms/pulse-anonymized.jslt
31
pulse-identified.jslt
let isFiltered = (contains(get-client(.), $filteredProviders))
{
"@id" : if ( ."@id" ) sha256-hex($salt + ."@id"),
"actor" : {
// remove one user identifier, but spt:userId also contains user ID
"@id" : if ( .actor."@id" ) null,
"spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress",
"spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6",
* : .
},
"device" : {
"environmentId" : if ( .device.environmentId ) null,
* : .
},
"location" : if (not($isFiltered)) .location,
* : .
}
32
Sinks
VG-ArticleViews-1:
eventType: PulseLoginPreserved
filter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"])
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"])
and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
33
routing-lib
• A Scala library that can load the YAML files
• main dependencies: Jackson and JSLT
• One main API method:
• RoutingConfig.route(JsonNode): Seq[(JsonNode, Sink)]
• Used by
• The Batch Job 2.0
• Yggdrasil
• Duratro
34
The Batch Job 2.0
• Three simple steps
• read JSON input from S3 (Spark solves this)
• push JSON data through routing-lib
• write JSON back to S3 (Spark solves this)
• There’s a little more to it than that, but that’s the heart of
it
• much better performance (much less data shuffling)
• better performance means it handles “cluster weather” more
robustly
• easier to catch up if we fall behind
35
Static routing
• Routing was configuration, packaged as a jar
• Every change required
• make routing PR, merge
• wait for Travis build to upload to Artifactory
• upgrade the batch job, deploy
• upgrade Yggdrasil, deploy
• upgrade Duratro, deploy
36
Hot deploy
37
routing
repo
Travis
build
SQS
queue
routing
config
publisher
S3
The Batch
Job
Yggdrasil Duratro
Self-serve
• Finding the right repo and learning a YAML syntax
is non-trivial
• What if users could instead use a user interface?
• select an event type
• pick a transform
• add a filter, if necessary
• then configure a sink
• press the button, and wham!
39
YAML format
• Was designed for this right from the start
• Having event-types.yaml separate
• enables reuse across batch and streaming
• but also in selfserve
• Making a flat format based on references
• avoids deep, nested tree structures in syntax
• means config can be merged from many sources
40
Hot deploy
42
routing
repo
Travis
build
SQS
queue
routing
config
publisher
S3
The Batch
Job
Yggdrasil Duratro
Pulse
Monitor
Dynamo-
DB
Lambda
Status
• Routing tree (207 sinks)
• streaming: 400 nodes (140 sinks)
• batch: 127 nodes (51 sinks)
• self-serve: ??? nodes (16 sinks)
• JSLT
• 51 transforms, 2366 lines
• runs ~10 billion transforms/day
• 28 contributors outside team
43
1 month of hot deploy
44
GDPR
Schibsted’s setup
• The individual sites are legally data controllers
• that means, they own the data and the responsibility
• Central Schibsted components are data processors
• that means, they do only what the controllers tell them to
• upside: responsibility rests with the controllers
• Has lots of consequences for how things work
46
Main issues
• Anonymization: handled with transforms
• Retention: handled with S3 lifecycle policies
• Takeout
• only necessary where we are the primary storage
• Deletion: a bit of a problem
• but we have 30 days to comply
47
The big picture
48
Privacy
Broker
Data
Platform
D-day
49
Next 3 weeks
50
Deletion
51
Impact
52
Data takeout
• Privacy broker posts message on SQS queue
• we take it down to S3, to get Luigi integration
• Luigi starts Spark job
• reads through stored data, looking for that user
• all events from that user are written to S3
• post SQS message back with reference to event and file
• Source data stored in Parquet
• use manually generated index files to avoid processing data
that has no events from this user
53
Data deletion
• Data is stored as Parquet files in S3
• but Parquet doesn’t have a “delete” function
• you have to rewrite the dataset
• This is slow and costly
• but can be batched: delete many users at once
• batching so many users that the index is useless
• What if someone is reading the dataset when you
are rewriting it?
54
Solution
• bucket/prefix/year=x/month=y/…/gen=0
• data stored under here initially
• bucket/prefix/year=x/month=y/…/gen=1
• data stored here after first rewrite
• once _SUCCESS flag is there consumers must switch
• after a day or so, gen=0 can be deleted
• Janitor Monkey deletes orphan generations & handles
retention
• because deletion rewrites data: messes up last modified
55
Data Access Layer
• Building logic to handle gen=x is a pain for users
• the Data Access Layer wraps Spark to do it for them
• Can in the future be expanded to
• filter out rows from users that opt out of certain processing
• do access control on a column level
• …
56
Access control
57
AWS Databox
account
Sites
Components
Mesos cluster
Analysts
AWS IAM policies
Jupyter-aaS
Spark
SQLaaS
Stricter access to data
• Because the sites are data controllers, they must
decide who should have access to what
• access is controlled by IAM policies
• but users can’t write those, and that’s not safe, anyway
• The system essentially requires communication
• data consumers must request data
• data owners (sites) but approve/reject requests
58
The granule of access
• We have many datasets
• Pulse (anonymized, identified, …)
• Payments (payment data)
• Content (ad and article content)
• …
• Access is per (dataset, site) combination
• you can have access to VG Pulse, but not Aftenposten
Pulse
59
Dataset registry
60
Email notification
61
Review screen
62
Challenges
• Users can potentially get access to many (dataset,
site) combinations
• each one needs to go into their IAM policies
• IAM has very strict API limits
• user inline policies: total max 2048 bytes
• managed policy size: max 6144
• max managed policies per account: 1500
• max attached managed policies: 10
• max group memberships: 10
63
Permission packing
• First pack as much as possible into an inline policy
• Then fill up personal managed policies & attach
• Then create more policies and attach to groups,
then attach those
• We believe we can attach 10,000 datasets this way
64
Sync
12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists,
must be cleaned
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from
lars.marius.garshol@schibsted.com
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com
12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com
12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on
lars.marius.garshol@schibsted.com
12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-1 exists, deleting
12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left
12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-2 exists, deleting
12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left
12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role
lars.marius.garshol@schibsted.com 65
Winding up
Slow maturation
• We started out with almost nothing in 2015
• now finally becoming something closer to what we need to be
• New challenges ahead
• management wants to scale up usage of common solutions
dramatically
• legal basis management is coming
• selfserve needs more functionality
• Data Quality Tooling needs an overhaul
• data discovery service likewise
• …
67
https://slideshare.net/larsga
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
Guido Schmutz
 

Was ist angesagt? (20)

Query DSL In Elasticsearch
Query DSL In ElasticsearchQuery DSL In Elasticsearch
Query DSL In Elasticsearch
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
 
Requery overview
Requery overviewRequery overview
Requery overview
 
ClojureScript Anatomy
ClojureScript AnatomyClojureScript Anatomy
ClojureScript Anatomy
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf(Big) Data Serialization with Avro and Protobuf
(Big) Data Serialization with Avro and Protobuf
 
Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016Logging logs with Logstash - Devops MK 10-02-2016
Logging logs with Logstash - Devops MK 10-02-2016
 
MongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquareMongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquare
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::ClientBuilding Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::Client
 
Mobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und KibanaMobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und Kibana
 
DZone Java 8 Block Buster: Query Databases Using Streams
DZone Java 8 Block Buster: Query Databases Using StreamsDZone Java 8 Block Buster: Query Databases Using Streams
DZone Java 8 Block Buster: Query Databases Using Streams
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 

Ähnlich wie Data collection in AWS at Schibsted

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
HostedbyConfluent
 

Ähnlich wie Data collection in AWS at Schibsted (20)

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
 
Hail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceHail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open source
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
S2GX 2012 - Introduction to Spring Integration and Spring Batch
S2GX 2012 - Introduction to Spring Integration and Spring BatchS2GX 2012 - Introduction to Spring Integration and Spring Batch
S2GX 2012 - Introduction to Spring Integration and Spring Batch
 
Node.js
Node.jsNode.js
Node.js
 
What’s new in WSO2 Enterprise Integrator 6.6
What’s new in WSO2 Enterprise Integrator 6.6What’s new in WSO2 Enterprise Integrator 6.6
What’s new in WSO2 Enterprise Integrator 6.6
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Distributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container EraDistributed Logging Architecture in the Container Era
Distributed Logging Architecture in the Container Era
 
Distributed Logging Architecture in Container Era
Distributed Logging Architecture in Container EraDistributed Logging Architecture in Container Era
Distributed Logging Architecture in Container Era
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
REST - Why, When and How? at AMIS25
REST - Why, When and How? at AMIS25REST - Why, When and How? at AMIS25
REST - Why, When and How? at AMIS25
 
Ballerina- A programming language for the networked world
Ballerina- A programming language for the networked worldBallerina- A programming language for the networked world
Ballerina- A programming language for the networked world
 

Mehr von Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Kürzlich hochgeladen

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Kürzlich hochgeladen (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Data collection in AWS at Schibsted

  • 1. Data collection in AWS Lars Marius Garshol, lars.marius.garshol@schibsted.com http://twitter.com/larsga 2018–09–04, AWS Meetup
  • 3. Schibsted 3 30 countries 200 million users/month 20 billion pageviews/month
  • 5. Data Platform team 5 Jordi Roura Ole-Magnus Røysted Aker Sangram Bal Oleksandr Ivanov Mårten Rånge Håvard Wall Bjørn Rustad Rolv Seehus Håkon Åmdal Fredrik Vraalsen Øyvind Løkling Per Wessel Nore Lars Marius Garshol Ning Zhou
  • 9. The Batch Job • Implemented in Apache Spark • Luigi for scheduling/orchestration • Runs in a shared Mesos cluster • this was set up because letting users create individual clusters became far too expensive • this cluster is the main cost driver for Schibsted’s AWS bills • Difficult environment to work with • hard to debug and develop in 9
  • 10. Problems with batch • Configuration file mapped to Spark tasks • very complex set of Spark tasks • requires lots of communication between Spark nodes • runs slowly • Very resource-intensive • had difficulty keeping up with incoming traffic • very sensitive to “cluster weather” • brittle 10
  • 11. Piper & Storage • Ordinary Java and Scala applications • read from the data source, perform all processing on one node, then write to the destination • no communication necessary between nodes • normal EC2 nodes with the application baked into the AMI • therefore scales trivially with Auto Scaling Groups • Instrumented with Datadog, logs loaded into SumoLogic 11
  • 13. Kafka vs Kinesis • Kinesis has very strict API limits • total read limit = 2x write limit • effectively limited to 2 readers • Kinesis API is very limited • basically only supports reading records in order • Kafka improves on both • can support many readers simultaneously • advanced Scala DSL with window functions etc etc 13
  • 15. Handling slow consumers 15 Kafka Yggdrasil Firehose One topic per consumer Duratro One topic per consumer OK OK OK OK Slow Transforms Filtering
  • 16. Pulse challenges • Pulse is a tracking solution with no user interface • you want dashboards to analyze user traffic? sorry • problem is: not enough resources to develop that • Using Amplitude to solve that • created a Duratro sink for Amplitude • simple HTTP POST of JSON events to Amplitude API • users can now create Amplitude projects, feed their Pulse events there, and finally have dashboards 16
  • 17. Transforms • Because GDPR we need to anonymize most incoming data formats • Some data has data quality issues that cannot be fixed at source, requires transforms to solve • In many cases data needs to be transformed from one format to another • Pulse to Amplitude • ClickMeter to Pulse • Convert data to match database structures • … 17
  • 18. Who configures? • Schibsted has >100 business units • for Data Platform to do detailed configuration for all of these isn’t going to scale • for sites to do it themselves saves lots of time • Transforms require domain knowledge • each site has its own specialities in Pulse tracking • to transform these correctly requires knowing all this 18
  • 19. Batch config: 1 sink { "driver": "anyoffilter", "name": "image-classification", "rules": [ { "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" }, { "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" } ], "onmatch": [ { "driver": "cache", "name": "image-classification", "level": "memory+disk" }, { "driver": "demuxer", "name": "image-classification", "rules": "${pulseSdrnFilterUri}", "parallel": true, "onmatch": { "driver": "textfilewriter", "uri": "${imageSiteUri}", "numFiles": { "eventsPerFile": 500000, "max": ${numExecutors} } } } ], 19 Early config was 1838 lines
  • 20. Yggdrasil: Scala DSL override def buildTopology(builder: StreamsBuilder): Unit = { import com.schibsted.spt.data.yggdrasil.serde.YggdrasilImplicitSerdes.{json, strings} // mads events routing val madsProEvents = madsEvents.tryFilter(MadsProEventsPredicate, deadLetterQueue("Default")) val madsPreEvents = madsEvents.tryFilter(MadsPreEventsPredicate, deadLetterQueue("Default")) val madsDevEvents = madsEvents.tryFilter(new EventSampler(0.01), deadLetterQueue("Default")) madsProEvents ~> contentTopic("Personalisation-Rocket-Pro") madsPreEvents ~> contentTopic("Personalisation-Rocket-Pre") madsDevEvents ~> contentTopic("Personalisation-Rocket-Dev") madsProEvents ~> providerIdDemuxer( "^urn:schibsted:madstorage-(rkt|web)-tayara-prod:mp-ads-delivery".r -> contentTopic("Image- "^urn:schibsted:madstorage-(rkt|web)-corotos-prod:mp-ads-delivery".r -> contentTopic(“Image- ) 20
  • 21. Duratro: config pipes { ATEDev { sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1" sink { type = "kinesis" stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG" region = "eu-west-1" role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0" session = "kinesis-ate-dev" } } 21
  • 22. Not in great shape • Transforms were written in Scala code • not easy to read even for Scala developers • most site devs are not Scala developers • Config changes require deploys • in streaming, matching changes must be made to both Yggdrasil and Duratro • Three different configuration syntaxes • definition of same type of event different in batch & streaming 22
  • 23. What if? • We had an expression language for JSON, kind of like jq • could write routing filters using that • We had a tranformation language for JSON • write as JSON template, using expression language to compute values to insert • A custom routing language for both batch and streaming, based on this language • designed for easy expressivity & deploy 23
  • 24. JSLT • Custom language for JSON transforms & queries • First iteration • JSON syntax with ${ … } wrappers for jq expressions • very simple additions: let, for and if expressions • tried out, worked well, but not ideal • Second iteration • own language from the ground up • far better performance • easier to write and use 24
  • 25. JSLT expressions 25 .foo Get “foo” key from input object .foo.bar As above + .bar inside that .foo == 231 Comparison .foo and .bar < 12 Boolean operator test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
  • 26. JSLT transforms { “insert_id” : .”@id”, “event_type” : ."@type" + " " + .object.”@type”, "device_id" : .device.environmentId, "time": amp:parse_timestamp(.published), "device_manufacturer": .device.manufacturer, "device_model": .device.model, "language": .device.acceptLanguage, "os_name": .device.osType, "os_version": .device.osVersion, … } 26
  • 27. More features • [for (.array) number(.) * 1.1] • convert each element in an array • * : . • object matcher, keeps rest of object unchanged • {for (.object) “prefix” + .key : .value} • dynamic rewrite of object • def func(p1, p2) • define custom functions 27
  • 28. Benefits of JSLT • Easier to read and write than code • Doesn’t require user to know Scala • Can be embedded in configuration • Flexible enough to support 99-100% of filters/ transforms • Performance quite good • 5-10x original language based on jackson-jq 28
  • 29.
  • 30. Routing language Firehose: description: All incoming events. transform: transforms/base-cleanup.jslt PulseBase: description: Cleaned-up Pulse events with all the information in them. baseType: Firehose filter: import "filters/pulse.jslt" as pulse pulse(.) transform: transforms/pulse-cleanup.jslt postFilter: import "filters/pulse-valid.jslt" as valid valid(.) 30
  • 31. Pulse definitions PulseIdentified: description: Pulse events with personally identifying information included. baseType: PulseBase filter: .actor."spt:userId" transform: transforms/pulse-identified.jslt PulseAnonymized: description: Pulse events with personally identifying information excluded. baseType: PulseBase transform: transforms/pulse-anonymized.jslt 31
  • 32. pulse-identified.jslt let isFiltered = (contains(get-client(.), $filteredProviders)) { "@id" : if ( ."@id" ) sha256-hex($salt + ."@id"), "actor" : { // remove one user identifier, but spt:userId also contains user ID "@id" : if ( .actor."@id" ) null, "spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress", "spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6", * : . }, "device" : { "environmentId" : if ( .device.environmentId ) null, * : . }, "location" : if (not($isFiltered)) .location, * : . } 32
  • 33. Sinks VG-ArticleViews-1: eventType: PulseLoginPreserved filter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"]) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views role: arn:aws:iam::070941167498:role/data-platform-kinesis-write VG-FrontExperimentsEngagement-1: eventType: PulseAnonymized filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"]) and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms)) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement role: arn:aws:iam::070941167498:role/data-platform-kinesis-write 33
  • 34. routing-lib • A Scala library that can load the YAML files • main dependencies: Jackson and JSLT • One main API method: • RoutingConfig.route(JsonNode): Seq[(JsonNode, Sink)] • Used by • The Batch Job 2.0 • Yggdrasil • Duratro 34
  • 35. The Batch Job 2.0 • Three simple steps • read JSON input from S3 (Spark solves this) • push JSON data through routing-lib • write JSON back to S3 (Spark solves this) • There’s a little more to it than that, but that’s the heart of it • much better performance (much less data shuffling) • better performance means it handles “cluster weather” more robustly • easier to catch up if we fall behind 35
  • 36. Static routing • Routing was configuration, packaged as a jar • Every change required • make routing PR, merge • wait for Travis build to upload to Artifactory • upgrade the batch job, deploy • upgrade Yggdrasil, deploy • upgrade Duratro, deploy 36
  • 38.
  • 39. Self-serve • Finding the right repo and learning a YAML syntax is non-trivial • What if users could instead use a user interface? • select an event type • pick a transform • add a filter, if necessary • then configure a sink • press the button, and wham! 39
  • 40. YAML format • Was designed for this right from the start • Having event-types.yaml separate • enables reuse across batch and streaming • but also in selfserve • Making a flat format based on references • avoids deep, nested tree structures in syntax • means config can be merged from many sources 40
  • 41.
  • 43. Status • Routing tree (207 sinks) • streaming: 400 nodes (140 sinks) • batch: 127 nodes (51 sinks) • self-serve: ??? nodes (16 sinks) • JSLT • 51 transforms, 2366 lines • runs ~10 billion transforms/day • 28 contributors outside team 43
  • 44. 1 month of hot deploy 44
  • 45. GDPR
  • 46. Schibsted’s setup • The individual sites are legally data controllers • that means, they own the data and the responsibility • Central Schibsted components are data processors • that means, they do only what the controllers tell them to • upside: responsibility rests with the controllers • Has lots of consequences for how things work 46
  • 47. Main issues • Anonymization: handled with transforms • Retention: handled with S3 lifecycle policies • Takeout • only necessary where we are the primary storage • Deletion: a bit of a problem • but we have 30 days to comply 47
  • 53. Data takeout • Privacy broker posts message on SQS queue • we take it down to S3, to get Luigi integration • Luigi starts Spark job • reads through stored data, looking for that user • all events from that user are written to S3 • post SQS message back with reference to event and file • Source data stored in Parquet • use manually generated index files to avoid processing data that has no events from this user 53
  • 54. Data deletion • Data is stored as Parquet files in S3 • but Parquet doesn’t have a “delete” function • you have to rewrite the dataset • This is slow and costly • but can be batched: delete many users at once • batching so many users that the index is useless • What if someone is reading the dataset when you are rewriting it? 54
  • 55. Solution • bucket/prefix/year=x/month=y/…/gen=0 • data stored under here initially • bucket/prefix/year=x/month=y/…/gen=1 • data stored here after first rewrite • once _SUCCESS flag is there consumers must switch • after a day or so, gen=0 can be deleted • Janitor Monkey deletes orphan generations & handles retention • because deletion rewrites data: messes up last modified 55
  • 56. Data Access Layer • Building logic to handle gen=x is a pain for users • the Data Access Layer wraps Spark to do it for them • Can in the future be expanded to • filter out rows from users that opt out of certain processing • do access control on a column level • … 56
  • 57. Access control 57 AWS Databox account Sites Components Mesos cluster Analysts AWS IAM policies Jupyter-aaS Spark SQLaaS
  • 58. Stricter access to data • Because the sites are data controllers, they must decide who should have access to what • access is controlled by IAM policies • but users can’t write those, and that’s not safe, anyway • The system essentially requires communication • data consumers must request data • data owners (sites) but approve/reject requests 58
  • 59. The granule of access • We have many datasets • Pulse (anonymized, identified, …) • Payments (payment data) • Content (ad and article content) • … • Access is per (dataset, site) combination • you can have access to VG Pulse, but not Aftenposten Pulse 59
  • 63. Challenges • Users can potentially get access to many (dataset, site) combinations • each one needs to go into their IAM policies • IAM has very strict API limits • user inline policies: total max 2048 bytes • managed policy size: max 6144 • max managed policies per account: 1500 • max attached managed policies: 10 • max group memberships: 10 63
  • 64. Permission packing • First pack as much as possible into an inline policy • Then fill up personal managed policies & attach • Then create more policies and attach to groups, then attach those • We believe we can attach 10,000 datasets this way 64
  • 65. Sync 12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists, must be cleaned 12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from lars.marius.garshol@schibsted.com 12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/ selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com 12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/ selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com 12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on lars.marius.garshol@schibsted.com 12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve- lars.marius.garshol@schibsted.com-1 exists, deleting 12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve- lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left 12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve- lars.marius.garshol@schibsted.com-2 exists, deleting 12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve- lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left 12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role lars.marius.garshol@schibsted.com 65
  • 67. Slow maturation • We started out with almost nothing in 2015 • now finally becoming something closer to what we need to be • New challenges ahead • management wants to scale up usage of common solutions dramatically • legal basis management is coming • selfserve needs more functionality • Data Quality Tooling needs an overhaul • data discovery service likewise • … 67