The Scout24 Data Platform - a technical deep dive

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
TheScout24 Data Platform
ATechnical Deep Dive
Sean Gustafson
Senior Technical Product Manager
Scout24
S e s s i o n I D
Raffael Dzikowski
Senior Data Engineer
Scout24

5
Core Geographies
and an overall presence
in 18 countries
80m
Household Reach
2
Major Household Brand Names
Scout24 AG
• SDAX
• € 489 million revenue (2017)
• ˜1500 employees

Our technicalevolution
Production
Database
Monolith app
Data Warehouse Data
Warehouse
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Microservice
Persistence
Persistence
Persistence
Persistence
Persistence
Persistence
Persistence

Our datawarehousewasabottleneck

Scout24 wants to become a truly data-driven company
Fast & easy data-driven
product development…
…supported by
Data & Analytics

Scout24 wants to become a truly data-driven company
Everywhere in the company... ...without bloating up Data &
Analytics

Our solution:
Build an internal “platform” for data

We thinkof our Data Platform asa Product
Just like AWS, Salesforce, etc. – the platform is a generic layer upon which
Scout24’s products can be built
BUT, we have a very, very small number of customers.
That means, product teams get personalized support and there is lots of
opportunity for collaboration.

Wedon’tdictateanything.
Wejusttrytomakecertainthingseasierbyofferinga“pavedpath”
Productteamsarefullyempoweredtomaketheirownchoicesaboutwhatisthe
bestuseoftheirresources.

“Inalmostallcases,wewillnot
mandatethatinternalteamuse
theseplatformsandservices—
theseplatform teamswill
to win over andsatisfy their
internalcustomers,
evencompetingwithexternal
vendors.”

Guiding principle of theplatform
Autonomy for producers and consumers
Self-service Analytics
Self-service Data Ingestion
Self-service ETL

Self-service ETL
Self-service Analytics
Central Data Lake on Amazon S3
Data
Scientist
AnalystPM Leader
Engineer
Analyst
Self-service Data Ingestion
Data
Producer

OurApproach

Multi-AccountSetting

Data LakeBucketTypes

DataLakeAccessRoles
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles

IngestionGoals
Microservices
Ingestion
Goals
Batch Support
Streaming Support
Rest APIScalability

IngestionOptions
Amazon
KinesisData
Firehose
Kafka Connect

IngestionOptions
Amazon
KinesisData
Firehose
Kafka Connect
Simple
Conﬁg
Simple
Conﬁg

FirehoseIngestionArchitecture
Data Lake Account
Amazon
KinesisData
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket

Data Lake Account
Amazon
KinesisData
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role

Data Lake Account
Amazon
KinesisData
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data

Data Lake Account
Amazon
KinesisData
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Write Data

KafkaConnect

KafkaConnect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS…

KafkaConnect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS…
Kafka
Connect
Cluster

KafkaConnect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS…
Kafka
Connect
ClusterRead Data from
Topic(s)

KafkaConnect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS…
Kafka
Connect
ClusterRead Data from
Topic(s)
Write
Data

Scout24InfinityCluster

Amazon ECS
Inﬁnity Service
Simple
Conﬁg

Amazon ECS
Inﬁnity Service
Simple
Conﬁg
Central Logging to
Elasticsearch
Monitoring in Datadog
Managed By Scout24
Cloud Platform
Engineering

RelatedBreakouts
15:00 in Hall 1
To Infinity and Beyond – Handling Heterogeneous Container
Clusters in AWS
Christine Trahe, Platform Engineer @ Scout24
16:00 in Hall 1
Boost your AWS Infrastructure
Philipp Garbe, AWS Container Hero @ Scout24

KafkaConnecton Infinity
Simple
Config
(Infinity)
Amazon ECS
Infinity Service

Simple
Conﬁg
(Kafka Connect)
Kafka Connect Service
Inﬁnity Service

KafkaConnectDeployment

DataWario –OurWrapper forAWS DataPipeline

Simple
Conﬁg
AWS Data
Pipeline
DataWario

Builtin Support for
Scout24 Ecosystem
Shortens
Development Cycles
Only Exposes
Conﬁguration
Essentials
Introduces Custom
Step Types
Automatically
Manages Artifacts
Simple
Conﬁg
AWS Data
Pipeline
DataWario

DataWarioArchitecture

Library ofCommon DataTransformations

QueryChallenges

What’sAhead

What’sAhead
OneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer

What’sAhead
Personal Analytics ClusterOneScout Hive Metastore
Skillsets
Groups

What’sAhead
Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore
Skillsets
Groups

OneScout HiveMetastore –ASchematicView
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake

OneScout HiveMetastore – Recapof Ecosystem

EMR MetastoreConfigurationOptions

TheScout24Hive MetastoreProxy –AMotivation

ThePersonalAnalyticsCluster –AnOverview

Amazon EMR
Personal Analytics
Cluster
Simple
Conﬁg

Amazon EMR
Personal Analytics
Cluster
Simple
Conﬁg
Easy Access via Web
Interface
Zeppelin and Jupyter
Notebook Restore
OneClick Deployment
Managed Scaling and
Shutdown
Support for Pre-baked
AMIs and Conﬁgs

Automated PartitionDetection –AMotivation
Personal Analytics
Cluster
Datalake

Personal Analytics
Cluster
Datalake
Partitioned Table

Personal Analytics
Cluster
Datalake
Partitioned Table
Data Ingestion

Personal Analytics
Cluster
Datalake
Partitioned Table
Data Ingestion
Table Access

Personal Analytics
Cluster
Datalake
Partitioned Table
Data Ingestion
Table Access
Automatic Partition Detection

PartitionDetectionArchitecture

Build our own vs.AWSmanaged services
Metastore  Glue
Presto  Athena
DataWario  Glue, Step function, Lambda, …
Personal Analytics Cluster  Glue notebooks, Sagemaker
We hope to throw out most of the custom components we build.

Our data platform holds nothing back

Extra slides

Centralized Federated
Control Autonomy
Perfection Scale
Pull Push
Product is Data Product is Platform
Reporting Reporting, Advanced Analytics,
Machine Learning, etc.
DataWarehouse vs. DataPlatform

Core
DB
APPAPPAPPAPPAPPAPPAPPAPPAPP
MicroStrategy
Presto
Central Data Lake on S3
CRM Core
DB
Micro
Service
REST API / Firehose
Data
Scientist
Jupyter
Zeppelin
Analyst
PM
Data
Producer
Personal
Analytics
Clusters
SQL
Alation
Data Catalog
Metastore
Leader
Engineer
Analyst
DataWario
Spark

OurJourney to Presto
Personal Analytics
Cluster
Datalake

Amazon
Athena

Amazon
Athena
On EMR

Amazon
Athena
On EMR
Cross Account Support
(OneScout Hive
Metastore)
Leverages Datalake
Access Roles (EMRFS)
Scheduled Scaling
Conﬁgurations
Fits our GDPRConcept
(multiple isolated
Clusters)

SCOUT24
DATA LANDSCAPE
MANIFESTO
ROLES, RESPONSIBILITIES, AND VALUES
FOR A DATA-DRIVEN COMPANY AT SCALE

Data is a key asset of our
company.
#1 Preamble

#2 Our Responsibility
We, Data & Analytics, are
responsible for providing a
solid Data Platform as well
as clear guidelines and
training how to participate
in the Data Landscape. Data Platform
DnA
Data Landscape

#3 Data Autonomy, Not Anarchy
Data autonomy puts data
producers & data consumers in
control of their data & of
their metrics and thereby allows
us to be data-driven at scale, but
this comes with responsibility. Data Platform
Data
Producer Consumer
DnA
Data Landscape

#4 Producer’s Responsibility
Data producers are responsible
for publishing data to the
central Data Lake, for the
data's quality, and for
publishing metadata that
makes it easy to find and
consume the data.
Data Platform
Metadata
Data
Producer
DnA
Data Landscape

#5 Consumer’s Responsibility
Data consumers are responsible
for the definition & visualization
of metrics and for driving the
implementation and
maintenance of these metrics.
Data Platform
Producer Consumer
DnA
Data Landscape

#6 Exception: Core KPIs
We, Data & Analytics, take the
full ownership and
responsibility of the few top
company-wide core KPIs.
Data Platform
Producer Consumer
DnA
Data Landscape
Core
metric

#7 Transparency Over Continuity
We value data transparency
over data continuity, which
means we may break metric
comparability if it is for the
cause of enabling better
insights. Data Platform
Producer Consumer
DnA
Data Landscape
Core
metric

The Ultimate Goal
Data Platform
Metadata
Data
Producer Consumer
DnA
Data Landscape
Core
metric
A federal landscape of data
producers and consumers with
just enough rules to ensure
seamless co-operation without
severely impeding autonomy.

The Scout24 Data Platform - a technical deep dive

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Scout24 Data Platform - a technical deep dive

Ähnlich wie The Scout24 Data Platform - a technical deep dive (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Scout24 Data Platform - a technical deep dive

Hinweis der Redaktion