Apache Kafka at trivago powers data pipelines

Apache Kafka at
trivago
2017-01-25, Munich, Germany
Clemens Valiente

Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente

3
As a hotel price comparison engine, our most
valuable information are hotel prices.
They are not only shown to our visitors to
support their hotel booking decision, but also
stored and later analyzed by Business
Intelligence.
With over one million hotels and all major
booking websites connected to our system, we
have one of the most complete sources of
information on hotel price development and
trends
Collecting price information for BI

4
The past: Data pipeline 2010 – 2015

5
Java Software
Engineering

6
Java Software
Engineering
BI
Warehouse

7
Java Software
Engineering
BI
Warehouse

8
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years

9
Facts & Figures
Price dimensions
180 days in advance
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins

10
Facts & Figures
Price dimensions
180 days in advance
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in early 2015 on
average around 100 million
prices per day were written
to BI

11
Java Software
Engineering
BI
Warehouse

12
Java Software
Engineering
BI
Warehouse

13
Java Software
Engineering
BI
Warehouse

14
Java Software
Engineering
BI
Warehouse

15
Java Software
Engineering
BI
Warehouse

16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT

17
Present data pipeline 2017 – ingestion
Düsseldorf

18
Düsseldorf

19
San Francisco
Düsseldorf
Hongkong

20
Present data pipeline 2017 – processing
Camus

21
Present data pipeline 2017 – results after two
years in production
• Very reliable, barely any downtime or service interruptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more resources to process it
• stakeholders very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests & workload for BI

22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing

23
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics

24
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
Camus
- Mapreduce application that
writes prices to hdfs
- 15 Mappers running in
parallel
- Pretty much continuously
in 10 minute intervals
- To be replaced by
Gobblin/Kafka Connect

25
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors

26
status quo
hotel market
- Anomaly and fraud
detection
marketing
- Display of price
development and
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka

27
status quo
hotel market
- Anomaly and fraud
detection
marketing
- Display of price
development and
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
Status quo
- Our entire BI business
logic runs on and through
the kafka – hadoop
pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data

28
Düsseldorf
Leipzig Palma
Ongoing Projects: Breaking up the Monolith

30
Key challenges and learnings
●
Settle on a common message format (Avro/Protobuf, not csv or json)
●
A common message envelope is helpful (e.g. header with timestamp and
sender)
●
For stream processing repeat your key in your message value
●
Monitor your consumer offsets with an audit log, especially across data
centres
●
Turn off auto creation of topics, but have a process in place for topic creation

Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
Thank you!
Questions
and
comments?

●
Thanks to Jan Filipiak for his brainpower behind most
projects
●
Additional resources:
●
https://github.com/trivago/gollum A n:m message
multiplexer written in Go
●
https://github.com/trivago/triava TriavaCache, JSR107
compliant cache

Apache Kafka at trivago powers data pipelines

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Apache Kafka at trivago powers data pipelines

Ähnlich wie Apache Kafka at trivago powers data pipelines (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache Kafka at trivago powers data pipelines