SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Apache Kafka at
trivago
2017-01-25, Munich, Germany
Clemens Valiente
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
3
As a hotel price comparison engine, our most
valuable information are hotel prices.
They are not only shown to our visitors to
support their hotel booking decision, but also
stored and later analyzed by Business
Intelligence.
With over one million hotels and all major
booking websites connected to our system, we
have one of the most complete sources of
information on hotel price development and
trends
Collecting price information for BI
4
The past: Data pipeline 2010 – 2015
5
The past: Data pipeline 2010 – 2015
Java Software
Engineering
6
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
7
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
8
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
9
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
10
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in early 2015 on
average around 100 million
prices per day were written
to BI
11
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
12
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
13
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
14
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
15
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT
17
Present data pipeline 2017 – ingestion
Düsseldorf
18
Present data pipeline 2017 – ingestion
Düsseldorf
19
Present data pipeline 2017 – ingestion
San Francisco
Düsseldorf
Hongkong
20
Present data pipeline 2017 – processing
Camus
21
Present data pipeline 2017 – results after two
years in production
• Very reliable, barely any downtime or service interruptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more resources to process it
• stakeholders very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests & workload for BI
22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
23
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
24
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
Camus
- Mapreduce application that
writes prices to hdfs
- 15 Mappers running in
parallel
- Pretty much continuously
in 10 minute intervals
- To be replaced by
Gobblin/Kafka Connect
25
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
26
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
27
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
Status quo
- Our entire BI business
logic runs on and through
the kafka – hadoop
pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data
28
Düsseldorf
Leipzig Palma
Ongoing Projects: Breaking up the Monolith
29
Düsseldorf
PalmaLeipzig
30
Key challenges and learnings
●
Settle on a common message format (Avro/Protobuf, not csv or json)
●
A common message envelope is helpful (e.g. header with timestamp and
sender)
●
For stream processing repeat your key in your message value
●
Monitor your consumer offsets with an audit log, especially across data
centres
●
Turn off auto creation of topics, but have a process in place for topic creation
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
Thank you!
Questions
and
comments?
●
Thanks to Jan Filipiak for his brainpower behind most
projects
●
Additional resources:
●
https://github.com/trivago/gollum A n:m message
multiplexer written in Go
●
https://github.com/trivago/triava TriavaCache, JSR107
compliant cache

Weitere ähnliche Inhalte

Was ist angesagt?

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseArangoDB Database
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolutionivan provalov
 
Full-on Hypermedia APIs with Hydra
Full-on Hypermedia APIs with HydraFull-on Hypermedia APIs with Hydra
Full-on Hypermedia APIs with HydraMarkus Lanthaler
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API OverviewSri Ambati
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Kafka: Uma introdução para Desenvolvedores e Arquitetos
 Kafka: Uma introdução para Desenvolvedores e Arquitetos Kafka: Uma introdução para Desenvolvedores e Arquitetos
Kafka: Uma introdução para Desenvolvedores e ArquitetosVictor Osorio
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data EngineeringAnanth PackkilDurai
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologyLucidworks
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 

Was ist angesagt? (20)

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph Database
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Full-on Hypermedia APIs with Hydra
Full-on Hypermedia APIs with HydraFull-on Hypermedia APIs with Hydra
Full-on Hypermedia APIs with Hydra
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API Overview
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Kafka: Uma introdução para Desenvolvedores e Arquitetos
 Kafka: Uma introdução para Desenvolvedores e Arquitetos Kafka: Uma introdução para Desenvolvedores e Arquitetos
Kafka: Uma introdução para Desenvolvedores e Arquitetos
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 

Ähnlich wie Apache Kafka at trivago powers data pipelines

Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
 
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...Sempl 21
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSouth Tyrol Free Software Conference
 
How open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyHow open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyMindtrek
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...yalisassoon
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmAlexander Oppel
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewGuido Schmutz
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementMatt Stubbs
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise ArchitectsNeo4j
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
 
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...mateja repovž
 
Data-informed Experience Design
Data-informed Experience DesignData-informed Experience Design
Data-informed Experience DesignInformaat
 
UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceMojtaba Imani
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Partha Sarathi Pattnaik
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Germany
 

Ähnlich wie Apache Kafka at trivago powers data pipelines (20)

Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
How open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyHow open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open Oy
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data Management
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs
 
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
 
Data-informed Experience Design
Data-informed Experience DesignData-informed Experience Design
Data-informed Experience Design
 
UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der Haar
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a Service
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
 

Kürzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Kürzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Apache Kafka at trivago powers data pipelines

  • 1. Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente
  • 2. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente
  • 3. 3 As a hotel price comparison engine, our most valuable information are hotel prices. They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence. With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends Collecting price information for BI
  • 4. 4 The past: Data pipeline 2010 – 2015
  • 5. 5 The past: Data pipeline 2010 – 2015 Java Software Engineering
  • 6. 6 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 7. 7 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 8. 8 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years
  • 9. 9 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins
  • 10. 10 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins Size of data - We collected a total of 56 billion prices in those five years - Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI
  • 11. 11 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 12. 12 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 13. 13 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 14. 14 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 15. 15 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 16. 16 Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT
  • 17. 17 Present data pipeline 2017 – ingestion Düsseldorf
  • 18. 18 Present data pipeline 2017 – ingestion Düsseldorf
  • 19. 19 Present data pipeline 2017 – ingestion San Francisco Düsseldorf Hongkong
  • 20. 20 Present data pipeline 2017 – processing Camus
  • 21. 21 Present data pipeline 2017 – results after two years in production • Very reliable, barely any downtime or service interruptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more resources to process it • stakeholders very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI
  • 22. 22 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing
  • 23. 23 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics
  • 24. 24 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics Camus - Mapreduce application that writes prices to hdfs - 15 Mappers running in parallel - Pretty much continuously in 10 minute intervals - To be replaced by Gobblin/Kafka Connect
  • 25. 25 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors
  • 26. 26 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka
  • 27. 27 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka Status quo - Our entire BI business logic runs on and through the kafka – hadoop pipeline - Almost all departments rely on data, insights and metrics delivered by hadoop - Most of the company could not do their job without hadoop data
  • 30. 30 Key challenges and learnings ● Settle on a common message format (Avro/Protobuf, not csv or json) ● A common message envelope is helpful (e.g. header with timestamp and sender) ● For stream processing repeat your key in your message value ● Monitor your consumer offsets with an audit log, especially across data centres ● Turn off auto creation of topics, but have a process in place for topic creation
  • 31. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente Thank you! Questions and comments?
  • 32. ● Thanks to Jan Filipiak for his brainpower behind most projects ● Additional resources: ● https://github.com/trivago/gollum A n:m message multiplexer written in Go ● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache