Clemens Valiente gives a presentation on trivago's use of Apache Kafka for data pipelines. He describes trivago's past data pipeline from 2010-2015, which had limitations. Their new pipeline uses Kafka for scalable ingestion and storage of large amounts of hotel price and other data. The pipeline reliably processes over 10 billion messages per day and supports many business uses. Ongoing work includes breaking up monolithic systems and improving data quality.
3. 3
As a hotel price comparison engine, our most
valuable information are hotel prices.
They are not only shown to our visitors to
support their hotel booking decision, but also
stored and later analyzed by Business
Intelligence.
With over one million hotels and all major
booking websites connected to our system, we
have one of the most complete sources of
information on hotel price development and
trends
Collecting price information for BI
6. 6
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
7. 7
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
8. 8
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
9. 9
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
10. 10
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in early 2015 on
average around 100 million
prices per day were written
to BI
11. 11
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
12. 12
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
13. 13
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
14. 14
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
15. 15
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
16. 16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT
21. 21
Present data pipeline 2017 – results after two
years in production
• Very reliable, barely any downtime or service interruptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more resources to process it
• stakeholders very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests & workload for BI
22. 22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
23. 23
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
24. 24
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
Camus
- Mapreduce application that
writes prices to hdfs
- 15 Mappers running in
parallel
- Pretty much continuously
in 10 minute intervals
- To be replaced by
Gobblin/Kafka Connect
25. 25
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
26. 26
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
27. 27
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
Status quo
- Our entire BI business
logic runs on and through
the kafka – hadoop
pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data
30. 30
Key challenges and learnings
●
Settle on a common message format (Avro/Protobuf, not csv or json)
●
A common message envelope is helpful (e.g. header with timestamp and
sender)
●
For stream processing repeat your key in your message value
●
Monitor your consumer offsets with an audit log, especially across data
centres
●
Turn off auto creation of topics, but have a process in place for topic creation
32. ●
Thanks to Jan Filipiak for his brainpower behind most
projects
●
Additional resources:
●
https://github.com/trivago/gollum A n:m message
multiplexer written in Go
●
https://github.com/trivago/triava TriavaCache, JSR107
compliant cache