The document provides an overview of the Scout24 Data Platform and its evolution towards becoming a truly data-driven company. Some key points:
- Scout24 operates various household brands across 18 countries with 80 million household reach.
- Historically, Scout24's technical architecture included a monolithic application and data warehouse that acted as a bottleneck.
- To address this, Scout24 built an internal "data platform" consisting of a microservices architecture, data lake, self-service analytics, and data ingestion tools to enable fast, easy product development supported by data and analytics.
- The data platform is thought of as a product in itself that provides generic layers for Scout24's products to be built upon
1. www.scout24.com
The Scout24 Data Platform
A Technical Deep Dive
Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
2. 5
Core Geographies
and an overall presence
in 18 countries
80m
Household Reach
2
Major Household Brand Names
Scout24 AG
• MDAX
• € 531.7 million revenue (2018)
3. Our technical evolution
Production
Database
Monolith app
Data Warehouse Data
Warehouse
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
Microservice
Persistence
Persistence
PersistencePersistence
Persistence
PersistencePersistence
9. We think of our Data Platform as a Product
• Just like AWS, Salesforce, etc. – the platform is a generic layer upon
which Scout24’s products can be built
• BUT, we have a very, very small number of customers.
• That means, product teams get personalized support and there is lots
of opportunity for collaboration.
10. We don’t dictate anything.
We just try to make certain things easier by offering a “paved path”
Product teams are fully empowered to make their own choices about
what is the best use of their resources.
11. “In almost all cases, we will
not mandate that internal
team use these platforms
and services—these
platform teams will have
to win over and satisfy
their internal customers,
sometimes even
competing with external
vendors.”
12. Guiding principle of the platform
• Autonomy for producers and consumers
Self-service Analytics
Self-service Data Ingestion
Self-service ETL
24. Watch the Video | Read the Blog Post on Medium
Necessary Cultural Changes during Migration
from Data Warehouse to Cloud Data Platform
Dr. Sean Gustafson, Scout24 AG
Data Festival 2018
25. 10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
26. 10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
Section #3
Self-Service Analytics Section #1
Self-Service Ingestion
Data Producer /
Engineer
PM / Leader /
Analyst
Section #2
Self-Service ETL
Engineer
30. Data Lake Bucket Types
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
31. Data Lake Access Roles
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
32. Data Lake Ingestion Goals
Microservices
Ingestion
Goals
Batch Support
Streaming Support
Rest APIScalability
36. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
37. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
38. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
39. Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Write Data
48. Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Central Logging to
Elasticsearch
Monitoring in Datadog
Managed by Scout24
Cloud Platform
Engineering
49. Get the Slide Deck
To Infinity and Beyond Handling
Heterogenous Container Clusters in AWS
Christine Trahe, Scout24 AG
AWS Summit Berlin 2019
50. Kafka Connect on Infinity
Simple
Config
(Infinity)
Amazon ECS
Infinity Service
64. Hadoop Rest API – A Motivation
Hadoop Rest API
{ }
Unified Set of
Ingestion Operations
Performs Event
Validation
Adds Reception
Timestamp Field
Stable User-Facing
Endpoints, Varying
Backend
65. Hadoop Rest API – Operation Overview
Hadoop Rest API
{ }
Add
Event
Batch
Add
Single
Event
PUT
PUT
Automatic
Partitioning by
Ingestion Date
Persisting to Disk
Initially, Periodically
Flushing to Target
Store
Also Publishing to
Kinesis Stream
66. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
67. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
68. Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Ingest
69. Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Ingest
70. Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
71. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
72. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
73. Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Ingest
75. Self Service ETL Requirements
Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of
metrics and for driving the implementation and maintenance of these metrics.
èData Consumers get a lot responsibility
èThe Data Platform must empower users to fulfill this responsibility
èIt must be easy to develop, test and maintain data processes
Multi-Account setup: Data processes must run in data consumer’s account
èNo centrally run processing tool
èAWS Managed service is preferred
80. DataWario
1. Client for AWS Data Pipeline
2. Transforms YAML into JSON format understood by Data Pipeline
3. Integrates with the Scout24 account setup and sets sensible configuration defaults
4. Augments Data Pipeline with additional features
5. Makes testing and debugging faster
81. DataWario – Architecture
Poll for Tasks
Start
Post Pipeline
Definition
Upload Artifacts
Any Scout24 Account
Data Center
Client with
DataWario
Server
Artifact Bucket
EC2 Instance
• Java Application, distributed as a JAR
• Shell wrapper (dw) for easy usage from the
command line
Deployment of a pipeline
1. User issues dw deploy
2. DataWario
1. uploads artifacts from user‘s machine to S3
2. generates JSON definition from YAML
3. creates data pipeline
3. AWS Data Pipeline runs the defined steps
84. Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
• CSV (or TSV)
• No schema
85. Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
• CSV (or TSV)
• No schema
„Custom“ Copy Activity
+
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
86. Speed up pipeline development
starts
runs
fails
takes ca. 10 min + job runtime
dw deploy
Typical development cycle
87. Speed up test and debug cycle
Without test cluster
starts
runs
fails
takes ca. 10 min + job runtime
With test cluster
dw test-cluster
create
dw deploy
-- test-cluster
uses
fails
Typical development cycle
takes job runtime
dw deploy
runs
88. DataWario brought us from here …
photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest-
jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by-
sa/3.0/legalcode
91. Paving the path further
• We built a lot of pipelines, so did our users
• We see repetitive tasks emerge in these pipelines
• Reimplementing solutions over and over again is waste
• We don‘t like waste
è Let‘s build reusable tools for common tasks
92. Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment"
Much of the power of the UNIX operating system comes from a style of
program design that makes programs easy to use and, more important, easy
to combine with other programs. This style has been called the use of
software tools, and depends more on how the programs fit into the
programming environment and how they can be used with other programs
than on how they are designed internally. [...] This style was based on the use
of tools: using programs separately or in combination to get a job done,
rather than doing it by hand, by monolithic self-sufficient subsystems, or by
special-purpose, one-time programs.
Title | Your name
92
93. Tools that do one thing, and do that well
• Advantages:
− Easy to configure, only a few settings
− Easy to test, because there are less branches to cover
− Easy to debug, output of each tool can be checked independently
• Drawbacks:
− More intermittent data written
− longer runtimes
94. Event snapshotter
• Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1?
• Input data: Change events for an entity (e.g. listing)
− User changed headline of a listing, number of pictures
− Listing got published/unpublished
• Snapshotter
1. takes snapshot from the day before
2. merges all change events from the day, keeps last
3. writes new snapshot for day
• Simple config (input and output path, name of ID and timestamp column, output path, partition schema)
• No business logic, agnostic of type of entity
95. Event Snapshotter Example
{"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false}
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/):
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/):
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
96. Aggregator
• Typical Questions:
− How many published listings did we have on Feb 1 2018?
− What‘s the daily average number of pageviews per car make?
• Input data:
− Interaction events of an entity ( e.g. page viewed, email sent, number called)
− output of Event Snapshotter
• Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
97. Spark Utils
• Collection of simple tools (usually one class per tool)
• Can be used
− standalone in a pipeline step
− as a dependency of your code
• Examples:
− Create a Hive table on-top of existing data
− Convert data from one Format to another (e.g. from JSON to Parquet for faster querying)
− Flatten nested structures
− publish data quality metrics to CloudWatch
− Materializing views
98. Current State / Outlook
• Data Pipeline with DataWario widely used throughout the company
• Challenges:
− Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features)
− Coordination between data pipelines only rudimentary
• Possible solutions:
− AWS Step Functions in combination with AWS Glue
− Introduction of a Workflow tool (e.g. AirFlow)
101. What’s Ahead
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
102. What’s Ahead
OneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
103. What’s Ahead
Personal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
104. What’s Ahead
Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
105. OneScout Hive Metastore – A Schematic View
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
106. OneScout Hive Metastore – Recap of Ecosystem
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
107. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
108. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
109. The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
110. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
The Scout24 Hive Metastore Proxy – A Motivation
111. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
The Scout24 Hive Metastore Proxy – A Motivation
112. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
The Scout24 Hive Metastore Proxy – A Motivation
113. Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
Forward
Request
Forward
Request
The Scout24 Hive Metastore Proxy – A Motivation
134. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
135. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
AWS
CloudFormation
136. The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
Easy Access via Web
Interface
Zeppelin and Jupyter
Notebook Restore
OneClick Deployment
Managed Scaling and
Shutdown
Support for Pre-baked
AMIs and Configs
AWS
CloudFormation
137. The Personal Analytics Cluster – Configuration Setup
• Cluster Name (default is usually username)
• AWS Account Name
• Market segment
• Use Case
• Shutdown time
• Instance Type
• Bid Price
• Key Pair
• EBS Volume Size
• Tear Down Existing Cluster (optional)
140. The Personal Analytics Cluster – Data Lake Access
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
141. The Personal Analytics Cluster – Data Lake Access
• Managed with AWS EMR FS Security Configuration
• Integrates with EMR Applications internally; Spark, Presto, etc
• Easy to define
• Maps a defined IAM role to S3 bucket being accessed
142. The Personal Analytics Cluster – Data Lake Access
Scenario One:
Non-restricted Data Lake Access
143. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
AWS Cloud
Regular Restricted Personal
Non-restricted
Role
Restricted Role
Personal Role
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
Data Lake
Account
144. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Non-restricted
Role
Restricted Role
Personal Role
Data Lake
Account
145. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
146. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
147. The Personal Analytics Cluster – Data Lake Access
Scenario Two:
Restricted/Personal Data Lake Access
148. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
149. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
150. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment A
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
151. The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment B
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
159. Christian Dietze Olalekan Elesin Raffael Dzikowski
Senior Data Engineer Data Engineer Senior Data Engineer
Scout24 Group Scout24 Group Scout24 Group
Thank you for your
attention!