SlideShare ist ein Scribd-Unternehmen logo
1 von 159
Downloaden Sie, um offline zu lesen
www.scout24.com
The Scout24 Data Platform
A Technical Deep Dive
Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
5
Core Geographies
and an overall presence
in 18 countries
80m
Household Reach
2
Major Household Brand Names
Scout24 AG
• MDAX
• € 531.7 million revenue (2018)
Our technical evolution
Production
Database
Monolith app
Data Warehouse Data
Warehouse
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
MicroserviceMicroservice
Microservice
Microservice
Persistence
Persistence
PersistencePersistence
Persistence
PersistencePersistence
Our data warehouse was a bottle neck
Scout24 wants to become a truly data-driven company
Fast & easy data-driven
product development…
…supported by
Data & Analytics
Scout24 wants to become a truly data-driven company
Everywhere in the company... ...without bloating up
Data & Analytics
Our solution:
Build an internal “platform” for data
What is the Scout24 Data “Platform”?
We think of our Data Platform as a Product
• Just like AWS, Salesforce, etc. – the platform is a generic layer upon
which Scout24’s products can be built
• BUT, we have a very, very small number of customers.
• That means, product teams get personalized support and there is lots
of opportunity for collaboration.
We don’t dictate anything.
We just try to make certain things easier by offering a “paved path”
Product teams are fully empowered to make their own choices about
what is the best use of their resources.
“In almost all cases, we will
not mandate that internal
team use these platforms
and services—these
platform teams will have
to win over and satisfy
their internal customers,
sometimes even
competing with external
vendors.”
Guiding principle of the platform
• Autonomy for producers and consumers
Self-service Analytics
Self-service Data Ingestion
Self-service ETL
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto
Data Landscape Manifesto – Purpose
Best Practices
Guidelines
Alignment
Principles
Responsi-
bilities
Data Landscape Manifesto – Purpose
Watch the Video | Read the Blog Post on Medium
Necessary Cultural Changes during Migration
from Data Warehouse to Cloud Data Platform
Dr. Sean Gustafson, Scout24 AG
Data Festival 2018
10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
10,000 Foot Architecture Overview
Personal
Analytics Cluster
Datalake
AWS Cloud
Amazon
Aurora
Microservices
OneScout Hive Metastore
Data Center
Core DB
DataWario
Firehose /
Kafka / Rest
API
PM / Leader / Analyst
Data Scientist / Analyst
Engineer
Engineer
Section #3
Self-Service Analytics Section #1
Self-Service Ingestion
Data Producer /
Engineer
PM / Leader /
Analyst
Section #2
Self-Service ETL
Engineer
Self-Service Ingestion
Multi-Account Setting
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Data Lake Bucket Types
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Access Roles
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
Data Lake Ingestion Goals
Microservices
Ingestion
Goals
Batch Support
Streaming Support
Rest APIScalability
Ingestion Options
Amazon
Kinesis Data
Firehose
Kafka Connect
Hadoop Rest API
{ }
Ingestion Options
Amazon
Kinesis Data
Firehose
Kafka Connect
Hadoop Rest API
{ }
Simple
Config
Simple
Config
Configuration Free
Kinesis Data Firehose Ingestion Mechanism
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Firehose Ingestion Architecture
Data Lake Account
Amazon
Kinesis Data
Firehose
Producer to
Firehose Role
Producer Account
Data Producer
Firehose to
Datalake Role
Datalake Bucket
STS Assume
Role
STS Assume
Role
Send Data
Write Data
Kafka Connect Ingestion Mechanism
Kafka Connect
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Read Data from
Topic(s)
Kafka Connect
Elasticsearch
Amazon S3
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka
Connect
Cluster
Read Data from
Topic(s)
Write
Data
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Scout24 Infinity Cluster
Amazon ECS
Infinity Service
Simple
Config
Central Logging to
Elasticsearch
Monitoring in Datadog
Managed by Scout24
Cloud Platform
Engineering
Get the Slide Deck
To Infinity and Beyond Handling
Heterogenous Container Clusters in AWS
Christine Trahe, Scout24 AG
AWS Summit Berlin 2019
Kafka Connect on Infinity
Simple
Config
(Infinity)
Amazon ECS
Infinity Service
Kafka Connect on Infinity
Kafka Connect on Infinity
Kafka Connect Service
Infinity Service
Kafka Connect on Infinity
Simple
Config
(Kafka Connect)
Kafka Connect Service
Infinity Service
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Kafka
Connect
Cluster
Amazon S3
Elasticsearch
ActiveMQ
Cassandra
Kafka
RDMBS
…
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Kafka
Connect
Cluster
Amazon S3
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Amazon S3
Kafka Connect Deployment
Write
Data
Read Data from
Topic(s)
Amazon S3
Kafka
Connect
Service
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
Amazon ECS
Infinity Cluster
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Read Topic
Kafka Connect Deployment
Kafka Account Infinity Account Datalake Account
Datalake Bucket
Kafka Connect
Ingestion Role
EC2 Instances
Kafka Cluster
STS Assume
Role
Kafka Connect
Container
Amazon ECS
Kafka Connect Service
(On Infinity)
Read Topic
Write Topic
Hadoop Rest API Ingestion Mechanism
Hadoop Rest API – A Motivation
Hadoop Rest API
{ }
Unified Set of
Ingestion Operations
Performs Event
Validation
Adds Reception
Timestamp Field
Stable User-Facing
Endpoints, Varying
Backend
Hadoop Rest API – Operation Overview
Hadoop Rest API
{ }
Add
Event
Batch
Add
Single
Event
PUT
PUT
Automatic
Partitioning by
Ingestion Date
Persisting to Disk
Initially, Periodically
Flushing to Target
Store
Also Publishing to
Kinesis Stream
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Hadoop Rest API – Architecture Evolution (Phase 1)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Send Events
Ingest
Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Ingest
Hadoop Rest API – Architecture Evolution (Phase 2)
AWS Cloud Data Center
Load
Balancer
HRA Service Generation 1
Microservice
Datalake
Ingest
Send Events
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Hadoop Rest API – Architecture Evolution (Phase 3)
AWS Cloud Data Center
Microservice
HRA Service Generation 2
Infinity Service
Datalake
Send Events
Ingest
Self-Service ETL
Self Service ETL Requirements
Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of
metrics and for driving the implementation and maintenance of these metrics.
èData Consumers get a lot responsibility
èThe Data Platform must empower users to fulfill this responsibility
èIt must be easy to develop, test and maintain data processes
Multi-Account setup: Data processes must run in data consumer’s account
èNo centrally run processing tool
èAWS Managed service is preferred
AWS Data Pipeline
Team Account
Extract
Transform
Load
ETL runs in team account
AWS Data Pipeline
Team Account
Extract
Transform
Load
Teams run ETL in their accounts It‘s easy …. Or is it?
DataWario
to the rescue
DataWario
1. Client for AWS Data Pipeline
2. Transforms YAML into JSON format understood by Data Pipeline
3. Integrates with the Scout24 account setup and sets sensible configuration defaults
4. Augments Data Pipeline with additional features
5. Makes testing and debugging faster
DataWario – Architecture
Poll for Tasks
Start
Post Pipeline
Definition
Upload Artifacts
Any Scout24 Account
Data Center
Client with
DataWario
Server
Artifact Bucket
EC2 Instance
• Java Application, distributed as a JAR
• Shell wrapper (dw) for easy usage from the
command line
Deployment of a pipeline
1. User issues dw deploy
2. DataWario
1. uploads artifacts from user‘s machine to S3
2. generates JSON definition from YAML
3. creates data pipeline
3. AWS Data Pipeline runs the defined steps
Sensible defaults
EMR
release?
EC2
instance
type?
IAM role?
Subnet ID?
Sensible defaults
EMR
release?
EC2
instance
type?
IAM role?
Subnet ID?
emr:
emr-5.21.0
m5.xlarge
DataWarioResourceRole
subnet-12345
Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
• CSV (or TSV)
• No schema
Augmenting data pipeline
ID Make Model Price Mileage
1 VW Golf 22000 15000
2 BMW 320d 35000 40000
„Built-In“ Copy Activity
• CSV (or TSV)
• No schema
„Custom“ Copy Activity
+
1,VW,Golf,22000,15000
2,BMW,320d,35000,40000
Speed up pipeline development
starts
runs
fails
takes ca. 10 min + job runtime
dw deploy
Typical development cycle
Speed up test and debug cycle
Without test cluster
starts
runs
fails
takes ca. 10 min + job runtime
With test cluster
dw test-cluster
create
dw deploy
-- test-cluster
uses
fails
Typical development cycle
takes job runtime
dw deploy
runs
DataWario brought us from here …
photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest-
jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by-
sa/3.0/legalcode
to here
So, how do we get here?
Paving the path further
• We built a lot of pipelines, so did our users
• We see repetitive tasks emerge in these pipelines
• Reimplementing solutions over and over again is waste
• We don‘t like waste
è Let‘s build reusable tools for common tasks
Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment"
Much of the power of the UNIX operating system comes from a style of
program design that makes programs easy to use and, more important, easy
to combine with other programs. This style has been called the use of
software tools, and depends more on how the programs fit into the
programming environment and how they can be used with other programs
than on how they are designed internally. [...] This style was based on the use
of tools: using programs separately or in combination to get a job done,
rather than doing it by hand, by monolithic self-sufficient subsystems, or by
special-purpose, one-time programs.
Title | Your name
92
Tools that do one thing, and do that well
• Advantages:
− Easy to configure, only a few settings
− Easy to test, because there are less branches to cover
− Easy to debug, output of each tool can be checked independently
• Drawbacks:
− More intermittent data written
− longer runtimes
Event snapshotter
• Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1?
• Input data: Change events for an entity (e.g. listing)
− User changed headline of a listing, number of pictures
− Listing got published/unpublished
• Snapshotter
1. takes snapshot from the day before
2. merges all change events from the day, keeps last
3. writes new snapshot for day
• Simple config (input and output path, name of ID and timestamp column, output path, partition schema)
• No business logic, agnostic of type of entity
Event Snapshotter Example
{"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false}
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/):
{"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false}
January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/):
{"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true}
{"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
Aggregator
• Typical Questions:
− How many published listings did we have on Feb 1 2018?
− What‘s the daily average number of pageviews per car make?
• Input data:
− Interaction events of an entity ( e.g. page viewed, email sent, number called)
− output of Event Snapshotter
• Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
Spark Utils
• Collection of simple tools (usually one class per tool)
• Can be used
− standalone in a pipeline step
− as a dependency of your code
• Examples:
− Create a Hive table on-top of existing data
− Convert data from one Format to another (e.g. from JSON to Parquet for faster querying)
− Flatten nested structures
− publish data quality metrics to CloudWatch
− Materializing views
Current State / Outlook
• Data Pipeline with DataWario widely used throughout the company
• Challenges:
− Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features)
− Coordination between data pipelines only rudimentary
• Possible solutions:
− AWS Step Functions in combination with AWS Glue
− Introduction of a Workflow tool (e.g. AirFlow)
Self-Service Analytics
Query Challenges
What’s Ahead
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
OneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
Personal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
What’s Ahead
Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore
Unlock the Datalake for Scout24’s
Toolset and Users with Different
Skillsets
Data Analysis for Various User
Groups
Provide a Timely and Accurate
Update of the Metadata Layer
OneScout Hive Metastore – A Schematic View
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
OneScout Hive Metastore – Recap of Ecosystem
Product
Account
Product
Account
…
Product
Account
Product
Account
Product
Account
…
Product
Account
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
The Scout24 Hive Metastore Proxy – A Motivation
Goal: One Long-Running Metastore
DB every EMR connects to
Ideal Situation
Solution for Multi-Account-Setting
Amazon
Aurora
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
Central Datalake Account
Connect Connect
Read Data Read Data
Amazon EMR
Proxy Instance
Amazon
Aurora
Proxy Instance
Product Account Product Account
…
Metastore Account
Amazon EMR
ConnectConnect
Forward
Request
Forward
Request
The Scout24 Hive Metastore Proxy – A Motivation
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automatic Partition Detection
Automated Partition Detection – A Motivation
Personal Analytics
Cluster
Hive Tables and Presto Views
Datalake
Partitioned Table
Data Ingestion
Table Access
Automatic Partition Detection
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Trigger
Publish
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Get Bucket
Notification
Partition Detection Architecture
Metastore Account
VPC (in eu-west-1) Subnet
Datalake Account
Amazon
CloudWatch
AWS
CloudTrail
Datalake
Aggregate Events by Key
Filter Bucket Notifications
Partition DetectorPartition Detection
Events Topic
Amazon
Aurora
Metadata Item
Creator
Create Table Events
Topic
Table Trigger
Partition Info
Table
Data Events
Transfer to CloudWatch
Logs
Apply Rule
Send to Partition
Detector
Get Bucket
Notification Add Partition
Get Partition
Info
Personal Analytics Cluster
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
AWS
CloudFormation
The Personal Analytics Cluster – An Overview
Amazon EMR
Personal Analytics
Cluster
Simple
Config
Easy Access via Web
Interface
Zeppelin and Jupyter
Notebook Restore
OneClick Deployment
Managed Scaling and
Shutdown
Support for Pre-baked
AMIs and Configs
AWS
CloudFormation
The Personal Analytics Cluster – Configuration Setup
• Cluster Name (default is usually username)
• AWS Account Name
• Market segment
• Use Case
• Shutdown time
• Instance Type
• Bid Price
• Key Pair
• EBS Volume Size
• Tear Down Existing Cluster (optional)
$ ./deploy-cluster --cluster-name <> 
--account-name <> 
--market-segment <> 
--usecase <> 
--shutdown-time <> 
--instance-type <> 
--bid-price <> 
--key-pair <> 
--ebs-volume-size <> 
--teardown-existing <>
The Personal Analytics Cluster – Configuration Setup
The Personal Analytics Cluster – Data Lake Access
Regular Restricted Personal RestrictedRegular Personal
Raw Processed
Visibility Level
Data Lake Account
ImmobilienScout24
Data Lake Account
AutoScout24
Regular DL Bucket Access Role
Restricted DL Bucket Access Roles
Personal DL Bucket Access Roles
The Personal Analytics Cluster – Data Lake Access
• Managed with AWS EMR FS Security Configuration
• Integrates with EMR Applications internally; Spark, Presto, etc
• Easy to define
• Maps a defined IAM role to S3 bucket being accessed
The Personal Analytics Cluster – Data Lake Access
Scenario One:
Non-restricted Data Lake Access
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
AWS Cloud
Regular Restricted Personal
Non-restricted
Role
Restricted Role
Personal Role
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Non-restricted
Role
Restricted Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
Non-restricted
Role
Restricted Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Scenario Two:
Restricted/Personal Data Lake Access
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment
Account
AWS Cloud
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Restricted Role
Non-restricted
Role
Personal Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment A
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
The Personal Analytics Cluster – Data Lake Access
Amazon EMR
PAC
Segment B
Account
AWS Cloud
Restricted Role
Non-restricted
Role
Personal Role
emrfs-site.xml
my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”)
AWS Cloud
Regular Restricted Personal
EMR-FS
Role-to-Bucket
Mapping
STS Assume Role
Data Lake
Account
Data Platform SQL Access
Our Journey to Presto
Our Journey to Presto
On EMR
Cross Account Support
(OneScout Hive
Metastore)
Leverages Datalake
Access Roles (EMRFS)
Scheduled Scaling
Configurations
Fits our GDPR Concept
(multiple isolated
Clusters)
Our Journey to Presto
Average Presto Queries per Day
360Queries per Day
Number of Queries
We hope to throw out
most of the custom components
we build.
Our data warehouse was a bottle neck
Our data platform holds nothing back
Christian Dietze Olalekan Elesin Raffael Dzikowski
Senior Data Engineer Data Engineer Senior Data Engineer
Scout24 Group Scout24 Group Scout24 Group
Thank you for your
attention!

Weitere ähnliche Inhalte

Was ist angesagt?

Considerations for your Cloud Journey
Considerations for your Cloud JourneyConsiderations for your Cloud Journey
Considerations for your Cloud JourneyAmazon Web Services
 
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...Amazon Web Services
 
Making a cloud first strategy a practical reality
Making a cloud first strategy a practical realityMaking a cloud first strategy a practical reality
Making a cloud first strategy a practical realityAmazon Web Services
 
AWS Application Discovery Service
AWS Application Discovery ServiceAWS Application Discovery Service
AWS Application Discovery ServiceAmazon Web Services
 
AWS Well-Architected Framework
AWS Well-Architected FrameworkAWS Well-Architected Framework
AWS Well-Architected FrameworkHenrique Mecking
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Amazon Web Services
 
Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security Tom Laszewski
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...Amazon Web Services
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSAmazon Web Services
 
Defining Your Cloud Strategy
Defining Your Cloud StrategyDefining Your Cloud Strategy
Defining Your Cloud StrategyInternap
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost ManagementAmazon Web Services
 
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...Amazon Web Services
 
A Roadmap to Cloud Center of Excellence Adoption
A Roadmap to Cloud Center of Excellence AdoptionA Roadmap to Cloud Center of Excellence Adoption
A Roadmap to Cloud Center of Excellence AdoptionAmazon Web Services
 

Was ist angesagt? (20)

Considerations for your Cloud Journey
Considerations for your Cloud JourneyConsiderations for your Cloud Journey
Considerations for your Cloud Journey
 
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...
Iterating Towards a Cloud-Enabled IT Organization (ENT204-R2) - AWS re:Invent...
 
Making a cloud first strategy a practical reality
Making a cloud first strategy a practical realityMaking a cloud first strategy a practical reality
Making a cloud first strategy a practical reality
 
Building your Cloud Strategy
Building your Cloud StrategyBuilding your Cloud Strategy
Building your Cloud Strategy
 
AWS Application Discovery Service
AWS Application Discovery ServiceAWS Application Discovery Service
AWS Application Discovery Service
 
AWS Tagging Strategy
AWS Tagging StrategyAWS Tagging Strategy
AWS Tagging Strategy
 
AWS Well-Architected Framework
AWS Well-Architected FrameworkAWS Well-Architected Framework
AWS Well-Architected Framework
 
AWS Cloud Adoption Framework
AWS Cloud Adoption Framework AWS Cloud Adoption Framework
AWS Cloud Adoption Framework
 
Container Patterns
Container PatternsContainer Patterns
Container Patterns
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
 
Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security Cloud Migration, Application Modernization, and Security
Cloud Migration, Application Modernization, and Security
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
Microservices on AWS: Architectural Patterns and Best Practices | AWS Summit ...
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
 
OpenStack Framework Introduction
OpenStack Framework IntroductionOpenStack Framework Introduction
OpenStack Framework Introduction
 
Defining Your Cloud Strategy
Defining Your Cloud StrategyDefining Your Cloud Strategy
Defining Your Cloud Strategy
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
A Roadmap to Cloud Center of Excellence Adoption
A Roadmap to Cloud Center of Excellence AdoptionA Roadmap to Cloud Center of Excellence Adoption
A Roadmap to Cloud Center of Excellence Adoption
 

Ähnlich wie The Scout24 Data Platform (A Technical Deep Dive)

AWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala LumpurAWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala LumpurAmazon Web Services
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakAmazon Web Services
 
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Donnie Prakoso
 
20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWSAmazon Web Services Korea
 
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...DATAVERSITY
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Amazon Web Services
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Amazon Web Services
 
AWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/CloseAWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/CloseIan Massingham
 
Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015Ian Massingham
 
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...Amazon Web Services
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteKristana Kane
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfAhmed791434
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 Ian Massingham
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
 

Ähnlich wie The Scout24 Data Platform (A Technical Deep Dive) (20)

AWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala LumpurAWS Startup Insights Kuala Lumpur
AWS Startup Insights Kuala Lumpur
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam Elmalak
 
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
 
20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS
 
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
 
AWS Startup Insights Singapore
AWS Startup Insights SingaporeAWS Startup Insights Singapore
AWS Startup Insights Singapore
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
 
Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017Opening Keynote - AWS Summit SG 2017
Opening Keynote - AWS Summit SG 2017
 
AWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/CloseAWSome Day Manchester 2105 - Intro/Close
AWSome Day Manchester 2105 - Intro/Close
 
Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015Intro Presentation at AWS AWSome Day London September 2015
Intro Presentation at AWS AWSome Day London September 2015
 
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
Semplificare l'analisi dei dati con architetture "Serverless": architetture e...
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta Keynote
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 AWS AWSome Day London October 2015
AWS AWSome Day London October 2015
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 

Kürzlich hochgeladen

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 

Kürzlich hochgeladen (20)

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 

The Scout24 Data Platform (A Technical Deep Dive)

  • 1. www.scout24.com The Scout24 Data Platform A Technical Deep Dive Munich | March 20, 2019 |Christian Dietze, Olalekan Elesin, Raffael Dzikowski
  • 2. 5 Core Geographies and an overall presence in 18 countries 80m Household Reach 2 Major Household Brand Names Scout24 AG • MDAX • € 531.7 million revenue (2018)
  • 3. Our technical evolution Production Database Monolith app Data Warehouse Data Warehouse Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice MicroserviceMicroservice Microservice Microservice Persistence Persistence PersistencePersistence Persistence PersistencePersistence
  • 4. Our data warehouse was a bottle neck
  • 5. Scout24 wants to become a truly data-driven company Fast & easy data-driven product development… …supported by Data & Analytics
  • 6. Scout24 wants to become a truly data-driven company Everywhere in the company... ...without bloating up Data & Analytics
  • 7. Our solution: Build an internal “platform” for data
  • 8. What is the Scout24 Data “Platform”?
  • 9. We think of our Data Platform as a Product • Just like AWS, Salesforce, etc. – the platform is a generic layer upon which Scout24’s products can be built • BUT, we have a very, very small number of customers. • That means, product teams get personalized support and there is lots of opportunity for collaboration.
  • 10. We don’t dictate anything. We just try to make certain things easier by offering a “paved path” Product teams are fully empowered to make their own choices about what is the best use of their resources.
  • 11. “In almost all cases, we will not mandate that internal team use these platforms and services—these platform teams will have to win over and satisfy their internal customers, sometimes even competing with external vendors.”
  • 12. Guiding principle of the platform • Autonomy for producers and consumers Self-service Analytics Self-service Data Ingestion Self-service ETL
  • 22. Data Landscape Manifesto – Purpose Best Practices Guidelines Alignment Principles Responsi- bilities
  • 24. Watch the Video | Read the Blog Post on Medium Necessary Cultural Changes during Migration from Data Warehouse to Cloud Data Platform Dr. Sean Gustafson, Scout24 AG Data Festival 2018
  • 25. 10,000 Foot Architecture Overview Personal Analytics Cluster Datalake AWS Cloud Amazon Aurora Microservices OneScout Hive Metastore Data Center Core DB DataWario Firehose / Kafka / Rest API PM / Leader / Analyst Data Scientist / Analyst Engineer Engineer
  • 26. 10,000 Foot Architecture Overview Personal Analytics Cluster Datalake AWS Cloud Amazon Aurora Microservices OneScout Hive Metastore Data Center Core DB DataWario Firehose / Kafka / Rest API PM / Leader / Analyst Data Scientist / Analyst Engineer Engineer Section #3 Self-Service Analytics Section #1 Self-Service Ingestion Data Producer / Engineer PM / Leader / Analyst Section #2 Self-Service ETL Engineer
  • 28.
  • 30. Data Lake Bucket Types Regular Restricted Personal RestrictedRegular Personal Raw Processed Visibility Level
  • 31. Data Lake Access Roles Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24 Regular DL Bucket Access Role Restricted DL Bucket Access Roles Personal DL Bucket Access Roles
  • 32. Data Lake Ingestion Goals Microservices Ingestion Goals Batch Support Streaming Support Rest APIScalability
  • 34. Ingestion Options Amazon Kinesis Data Firehose Kafka Connect Hadoop Rest API { } Simple Config Simple Config Configuration Free
  • 35. Kinesis Data Firehose Ingestion Mechanism
  • 36. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket
  • 37. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role
  • 38. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role Send Data
  • 39. Firehose Ingestion Architecture Data Lake Account Amazon Kinesis Data Firehose Producer to Firehose Role Producer Account Data Producer Firehose to Datalake Role Datalake Bucket STS Assume Role STS Assume Role Send Data Write Data
  • 46. Scout24 Infinity Cluster Amazon ECS Infinity Service
  • 47. Scout24 Infinity Cluster Amazon ECS Infinity Service Simple Config
  • 48. Scout24 Infinity Cluster Amazon ECS Infinity Service Simple Config Central Logging to Elasticsearch Monitoring in Datadog Managed by Scout24 Cloud Platform Engineering
  • 49. Get the Slide Deck To Infinity and Beyond Handling Heterogenous Container Clusters in AWS Christine Trahe, Scout24 AG AWS Summit Berlin 2019
  • 50. Kafka Connect on Infinity Simple Config (Infinity) Amazon ECS Infinity Service
  • 51. Kafka Connect on Infinity
  • 52. Kafka Connect on Infinity Kafka Connect Service Infinity Service
  • 53. Kafka Connect on Infinity Simple Config (Kafka Connect) Kafka Connect Service Infinity Service
  • 54. Kafka Connect Deployment Write Data Read Data from Topic(s) Kafka Connect Cluster Amazon S3 Elasticsearch ActiveMQ Cassandra Kafka RDMBS …
  • 55. Kafka Connect Deployment Write Data Read Data from Topic(s) Kafka Connect Cluster Amazon S3
  • 56. Kafka Connect Deployment Write Data Read Data from Topic(s) Amazon S3
  • 57. Kafka Connect Deployment Write Data Read Data from Topic(s) Amazon S3 Kafka Connect Service
  • 58. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster Amazon ECS Infinity Cluster
  • 59. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity)
  • 60. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity)
  • 61. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity) Read Topic
  • 62. Kafka Connect Deployment Kafka Account Infinity Account Datalake Account Datalake Bucket Kafka Connect Ingestion Role EC2 Instances Kafka Cluster STS Assume Role Kafka Connect Container Amazon ECS Kafka Connect Service (On Infinity) Read Topic Write Topic
  • 63. Hadoop Rest API Ingestion Mechanism
  • 64. Hadoop Rest API – A Motivation Hadoop Rest API { } Unified Set of Ingestion Operations Performs Event Validation Adds Reception Timestamp Field Stable User-Facing Endpoints, Varying Backend
  • 65. Hadoop Rest API – Operation Overview Hadoop Rest API { } Add Event Batch Add Single Event PUT PUT Automatic Partitioning by Ingestion Date Persisting to Disk Initially, Periodically Flushing to Target Store Also Publishing to Kinesis Stream
  • 66. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake
  • 67. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Send Events
  • 68. Hadoop Rest API – Architecture Evolution (Phase 1) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Send Events Ingest
  • 69. Hadoop Rest API – Architecture Evolution (Phase 2) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Ingest Send Events Ingest
  • 70. Hadoop Rest API – Architecture Evolution (Phase 2) AWS Cloud Data Center Load Balancer HRA Service Generation 1 Microservice Datalake Ingest Send Events
  • 71. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake
  • 72. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake Send Events
  • 73. Hadoop Rest API – Architecture Evolution (Phase 3) AWS Cloud Data Center Microservice HRA Service Generation 2 Infinity Service Datalake Send Events Ingest
  • 75. Self Service ETL Requirements Data Landscape Manifesto Principle #5: Data consumers are responsible for the definition and visualization of metrics and for driving the implementation and maintenance of these metrics. èData Consumers get a lot responsibility èThe Data Platform must empower users to fulfill this responsibility èIt must be easy to develop, test and maintain data processes Multi-Account setup: Data processes must run in data consumer’s account èNo centrally run processing tool èAWS Managed service is preferred
  • 76. AWS Data Pipeline Team Account Extract Transform Load ETL runs in team account
  • 77. AWS Data Pipeline Team Account Extract Transform Load Teams run ETL in their accounts It‘s easy …. Or is it?
  • 78.
  • 80. DataWario 1. Client for AWS Data Pipeline 2. Transforms YAML into JSON format understood by Data Pipeline 3. Integrates with the Scout24 account setup and sets sensible configuration defaults 4. Augments Data Pipeline with additional features 5. Makes testing and debugging faster
  • 81. DataWario – Architecture Poll for Tasks Start Post Pipeline Definition Upload Artifacts Any Scout24 Account Data Center Client with DataWario Server Artifact Bucket EC2 Instance • Java Application, distributed as a JAR • Shell wrapper (dw) for easy usage from the command line Deployment of a pipeline 1. User issues dw deploy 2. DataWario 1. uploads artifacts from user‘s machine to S3 2. generates JSON definition from YAML 3. creates data pipeline 3. AWS Data Pipeline runs the defined steps
  • 83. Sensible defaults EMR release? EC2 instance type? IAM role? Subnet ID? emr: emr-5.21.0 m5.xlarge DataWarioResourceRole subnet-12345
  • 84. Augmenting data pipeline ID Make Model Price Mileage 1 VW Golf 22000 15000 2 BMW 320d 35000 40000 „Built-In“ Copy Activity 1,VW,Golf,22000,15000 2,BMW,320d,35000,40000 • CSV (or TSV) • No schema
  • 85. Augmenting data pipeline ID Make Model Price Mileage 1 VW Golf 22000 15000 2 BMW 320d 35000 40000 „Built-In“ Copy Activity • CSV (or TSV) • No schema „Custom“ Copy Activity + 1,VW,Golf,22000,15000 2,BMW,320d,35000,40000
  • 86. Speed up pipeline development starts runs fails takes ca. 10 min + job runtime dw deploy Typical development cycle
  • 87. Speed up test and debug cycle Without test cluster starts runs fails takes ca. 10 min + job runtime With test cluster dw test-cluster create dw deploy -- test-cluster uses fails Typical development cycle takes job runtime dw deploy runs
  • 88. DataWario brought us from here … photo by Dirk van der Made (https://commons.wikimedia.org/wiki/File:DirkvdM_cloudforest- jungle.jpg), „DirkvdM cloudforest-jungle“, https://creativecommons.org/licenses/by- sa/3.0/legalcode
  • 90. So, how do we get here?
  • 91. Paving the path further • We built a lot of pipelines, so did our users • We see repetitive tasks emerge in these pipelines • Reimplementing solutions over and over again is waste • We don‘t like waste è Let‘s build reusable tools for common tasks
  • 92. Rob Pike; Brian W. Kernighan (October 1984). "Program Design in the UNIX Environment" Much of the power of the UNIX operating system comes from a style of program design that makes programs easy to use and, more important, easy to combine with other programs. This style has been called the use of software tools, and depends more on how the programs fit into the programming environment and how they can be used with other programs than on how they are designed internally. [...] This style was based on the use of tools: using programs separately or in combination to get a job done, rather than doing it by hand, by monolithic self-sufficient subsystems, or by special-purpose, one-time programs. Title | Your name 92
  • 93. Tools that do one thing, and do that well • Advantages: − Easy to configure, only a few settings − Easy to test, because there are less branches to cover − Easy to debug, output of each tool can be checked independently • Drawbacks: − More intermittent data written − longer runtimes
  • 94. Event snapshotter • Typical questions: What was the headline of a listing on Feb 22, 2016? Was it published? How about March 1? • Input data: Change events for an entity (e.g. listing) − User changed headline of a listing, number of pictures − Listing got published/unpublished • Snapshotter 1. takes snapshot from the day before 2. merges all change events from the day, keeps last 3. writes new snapshot for day • Simple config (input and output path, name of ID and timestamp column, output path, partition schema) • No business logic, agnostic of type of entity
  • 95. Event Snapshotter Example {"id": 1, "changed": "2019-01-01T15:31:00.000+01:00", "headline": „nice appartment", "published": false} {"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false} {"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true} {"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true} January 1, 2019 (s3://snapshotBucket/snapshot_day=2019-01-01/): {"id": 1, "changed": "2019-01-01T18:31:00.000+01:00", "headline": "A very nice appartment", "published": false} January 2, 2019 (s3://snapshotBucket/snapshot_day=2019-01-02/): {"id": 1, "changed": "2019-01-02T06:31:00.000+01:00", "headline": "A very nice appartment", "published": true} {"id": 2, "changed": "2019-01-02T07:28:00.000+01:00", "headline": "Another nice appartment", "published": true}
  • 96. Aggregator • Typical Questions: − How many published listings did we have on Feb 1 2018? − What‘s the daily average number of pageviews per car make? • Input data: − Interaction events of an entity ( e.g. page viewed, email sent, number called) − output of Event Snapshotter • Simple config (input and output paths, timestamp column, grouping column, aggregation method and column)
  • 97. Spark Utils • Collection of simple tools (usually one class per tool) • Can be used − standalone in a pipeline step − as a dependency of your code • Examples: − Create a Hive table on-top of existing data − Convert data from one Format to another (e.g. from JSON to Parquet for faster querying) − Flatten nested structures − publish data quality metrics to CloudWatch − Materializing views
  • 98. Current State / Outlook • Data Pipeline with DataWario widely used throughout the company • Challenges: − Data Pipeline not getting much love from AWS (not deprecated, but not getting many new features) − Coordination between data pipelines only rudimentary • Possible solutions: − AWS Step Functions in combination with AWS Glue − Introduction of a Workflow tool (e.g. AirFlow)
  • 101. What’s Ahead Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 102. What’s Ahead OneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 103. What’s Ahead Personal Analytics ClusterOneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 104. What’s Ahead Automatic Hive Partition DetectionPersonal Analytics ClusterOneScout Hive Metastore Unlock the Datalake for Scout24’s Toolset and Users with Different Skillsets Data Analysis for Various User Groups Provide a Timely and Accurate Update of the Metadata Layer
  • 105. OneScout Hive Metastore – A Schematic View Personal Analytics Cluster Hive Tables and Presto Views Datalake
  • 106. OneScout Hive Metastore – Recap of Ecosystem Product Account Product Account … Product Account Product Account Product Account … Product Account Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24
  • 107. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting
  • 108. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account
  • 109. The Scout24 Hive Metastore Proxy – A Motivation Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect
  • 110. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data The Scout24 Hive Metastore Proxy – A Motivation
  • 111. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR The Scout24 Hive Metastore Proxy – A Motivation
  • 112. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR ConnectConnect The Scout24 Hive Metastore Proxy – A Motivation
  • 113. Goal: One Long-Running Metastore DB every EMR connects to Ideal Situation Solution for Multi-Account-Setting Amazon Aurora Amazon EMR Amazon EMRAmazon EMR Amazon EMR Central Datalake Account Connect Connect Read Data Read Data Amazon EMR Proxy Instance Amazon Aurora Proxy Instance Product Account Product Account … Metastore Account Amazon EMR ConnectConnect Forward Request Forward Request The Scout24 Hive Metastore Proxy – A Motivation
  • 114. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake
  • 115. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table
  • 116. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion
  • 117. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion
  • 118. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access
  • 119. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access
  • 120. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access Automatic Partition Detection
  • 121. Automated Partition Detection – A Motivation Personal Analytics Cluster Hive Tables and Presto Views Datalake Partitioned Table Data Ingestion Table Access Automatic Partition Detection
  • 122. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table
  • 123. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Trigger Publish
  • 124. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table
  • 125. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events
  • 126. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs
  • 127. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule
  • 128. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector
  • 129. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector Get Bucket Notification
  • 130. Partition Detection Architecture Metastore Account VPC (in eu-west-1) Subnet Datalake Account Amazon CloudWatch AWS CloudTrail Datalake Aggregate Events by Key Filter Bucket Notifications Partition DetectorPartition Detection Events Topic Amazon Aurora Metadata Item Creator Create Table Events Topic Table Trigger Partition Info Table Data Events Transfer to CloudWatch Logs Apply Rule Send to Partition Detector Get Bucket Notification Add Partition Get Partition Info
  • 132. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster AWS CloudFormation
  • 133. The Personal Analytics Cluster – An Overview
  • 134. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster AWS CloudFormation
  • 135. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster Simple Config AWS CloudFormation
  • 136. The Personal Analytics Cluster – An Overview Amazon EMR Personal Analytics Cluster Simple Config Easy Access via Web Interface Zeppelin and Jupyter Notebook Restore OneClick Deployment Managed Scaling and Shutdown Support for Pre-baked AMIs and Configs AWS CloudFormation
  • 137. The Personal Analytics Cluster – Configuration Setup • Cluster Name (default is usually username) • AWS Account Name • Market segment • Use Case • Shutdown time • Instance Type • Bid Price • Key Pair • EBS Volume Size • Tear Down Existing Cluster (optional)
  • 138. $ ./deploy-cluster --cluster-name <> --account-name <> --market-segment <> --usecase <> --shutdown-time <> --instance-type <> --bid-price <> --key-pair <> --ebs-volume-size <> --teardown-existing <> The Personal Analytics Cluster – Configuration Setup
  • 139.
  • 140. The Personal Analytics Cluster – Data Lake Access Regular Restricted Personal RestrictedRegular Personal Raw Processed Visibility Level Data Lake Account ImmobilienScout24 Data Lake Account AutoScout24 Regular DL Bucket Access Role Restricted DL Bucket Access Roles Personal DL Bucket Access Roles
  • 141. The Personal Analytics Cluster – Data Lake Access • Managed with AWS EMR FS Security Configuration • Integrates with EMR Applications internally; Spark, Presto, etc • Easy to define • Maps a defined IAM role to S3 bucket being accessed
  • 142. The Personal Analytics Cluster – Data Lake Access Scenario One: Non-restricted Data Lake Access
  • 143. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml AWS Cloud Regular Restricted Personal Non-restricted Role Restricted Role Personal Role my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) Data Lake Account
  • 144. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping Non-restricted Role Restricted Role Personal Role Data Lake Account
  • 145. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud Non-restricted Role Restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 146. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud Non-restricted Role Restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-regular-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 147. The Personal Analytics Cluster – Data Lake Access Scenario Two: Restricted/Personal Data Lake Access
  • 148. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping Restricted Role Non-restricted Role Personal Role Data Lake Account
  • 149. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment Account AWS Cloud emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Restricted Role Non-restricted Role Personal Role Data Lake Account
  • 150. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment A Account AWS Cloud Restricted Role Non-restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 151. The Personal Analytics Cluster – Data Lake Access Amazon EMR PAC Segment B Account AWS Cloud Restricted Role Non-restricted Role Personal Role emrfs-site.xml my_data = spark.read.parquet(“s3://our-restricted-datalake-bucket/some-data”) AWS Cloud Regular Restricted Personal EMR-FS Role-to-Bucket Mapping STS Assume Role Data Lake Account
  • 152. Data Platform SQL Access
  • 153. Our Journey to Presto
  • 154. Our Journey to Presto On EMR Cross Account Support (OneScout Hive Metastore) Leverages Datalake Access Roles (EMRFS) Scheduled Scaling Configurations Fits our GDPR Concept (multiple isolated Clusters)
  • 155. Our Journey to Presto Average Presto Queries per Day 360Queries per Day Number of Queries
  • 156. We hope to throw out most of the custom components we build.
  • 157. Our data warehouse was a bottle neck
  • 158. Our data platform holds nothing back
  • 159. Christian Dietze Olalekan Elesin Raffael Dzikowski Senior Data Engineer Data Engineer Senior Data Engineer Scout24 Group Scout24 Group Scout24 Group Thank you for your attention!