This document discusses using Kubernetes as a data platform. It describes using use case driven development to build the initial platform, focusing on simple use cases that provide value. It also covers onboarding new data sources, an overview of the data platform architecture including data lakes and batch/online services, deployment approaches both on-premise and cloud native, and addressing challenges like GDPR compliance and autoscaling. Lessons learned include selecting cloud infrastructure based on data locations and using Kubernetes for its support and to avoid maintaining separate clusters.
7. Cloud Selection
7
The Pragmatic Choice
➔ Known to people in the dev teams
➔ New base platform for all other applications within
Bonnier News
8. Use Case Driven Development
➔ Use cases drive the development of the platform
➔ Focus on value and quality not on slurping in all data in the company
➔ Start with simple use cases!
8
9. 9
FIND USE CASE
THAT PROVIDE
VALUE
NEW DATA INTO
THE PLATFORM
EVOLVE THE
PLATFORM
BASED ON
REQUIREMENTS
Use Case Driven Development
10. ● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ extraction?
○ data governance?
○ history?
Data-centric innovation
10
15. Data platform overview
15
Data lake
Cold
store
Dataset
Job
Service
Service
Online
services
Offline
data platform
Batch
processing
16. Data platform overview
16
Data lake
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Batch
processing
Workflow
orchestration
17. Data platform overview
17
Data lake
Batch
processing
Online
services
Cold
store
Service
Data feature
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Internal
services
Job
18. Life of a change, batch pipelines
18
● My pipeline, version 2!
○ Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Easy to test end-to-end
○ Upstream team can do the change
∆?
19. Egress target change
19
● Need output in different storage!
○ Adding egress target is easy
○ Egress target backfill is easy
● Facilitates cost limitation
○ Partially aggregate → BigQuery / Redshift
○ Limited retention in egress storage
20. Life of an error, batch pipelines
20
● My dataset, bad version!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
21. Deployment example, on-premise
21
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
28. GDPR
Article 17.
“The data subject shall have the right to obtain from the controller the erasure of personal data concerning
him or her without undue delay and the controller shall have the obligation to erase personal data without
undue delay where one of the following grounds applies:“
➔ the personal data are no longer necessary in relation to the purposes for which they were collected
or otherwise processed - Data Retention
➔ the data subject withdraws consent on which the processing is based - Data Deletion Requests
28
30. GDPR - Retention
30
{
id: ….
pii: [...]
}
CREATE
KEY FOR
ID
ENCRYPT
PERSONAL DATA
WITH KEY
➔ Each dataset has a retention time from
the owners of the data
➔ Create new keys each 30 days
➔ Destroy keys older than the retention
time
31. GDPR - Right to be forgotten
31
List of users
that have
requested
deletion
Find keys
for those
users
Destroy
keys
32. Use Cases in Use
➔ Machine Learning
◆ Built a system that tries to predict if a visitor will watch an ad in a video or not
➔ Creating Reports
◆ Daily reporting data for ad team
◆ Weekly report of ad viewing data for site team
➔ GDPR Registry Extract
◆ Collect data from multiple different sources
◆ Merge the data
◆ Send data to be viewed by the user
32
33. Lessons Learned
Cloud selection is influenced by data location
Most data for the use cases we started with was on Google Cloud Storage / BigQuery
incurring extra development time and cost to exfiltrate that data.
Kubernetes?
Same platform as other teams + great support from infrastructure platform team.
No Spark cluster maintenance, tweaking, debugging.
Autoscaling works, but some challenges for batch jobs.
33
34. Summary
Use case driven development == Short Time to Production
First pipeline in 3 weeks
Small team 2-4 People
Keep it simple
10-15 Pipelines
34