As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports.
In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including:
-The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau
-How the data is brought together to deliver a unified view of live customer engagement and historical trends
-Operational best practices we learnt scaling Druid
-An example walk through using the platform
2. Twitch
Twitch is where millions of
people come together live
every day to chat, interact,
and make their own
entertainment together.
3.
4. Data
Infrastructure Develops and operates Twitch's
data platform that powers data
systems and decision-making.
Our data pipeline receives over
80 billion events a day
We provide tools to ingest,
store, transform, move, and
understand data.
7. Empower
decision
making
Data is a critical part of work
and decision making at Twitch.
Our goal as a team is to
empower people to find, access
and use data in their decision
making
Data staff Non data staff
8. Non Data Staff flow
Identify required
data
Analyze data to
make decision
Request data from
data staff
03
01 02
9. Data Staff flow
Identify required
data
Analyze data to
make decision
Use BI tool to write
query to retrieve
and present data
03
01 02
10. Alice & Bob All subscriptions in the US over the
past year
Here's the data
What about over the past 5 years?
Here's the data
Here's the data
How about in the UK, South Korea,
Brazil?
11. Challenges
● Takes a long time to get
results
● Dependent on Data
Staff
● Repeated cycle to drill in
● Different view
presentation from
different data staff
Non Data
Staff
● Additional work fulfilling requests from
non data staff
● Discovery and understanding of data
● Understanding and translating data
requests
● Different results from different data staff
(quality, inconsistency)
Data Staff
13. Requirements
● Unified, consistent user
interface
● Self serve
● Reproducible, shareable
results
● Fast query speed
● Trusted aggregated data
with owners
14. Scale
Daily events
processed via
kinesis
8.5 billion
1.3 TB
Daily events
ingested via hadoop
5.6 GB
Data Sources 50
Cluster Storage used 80 TB
No of queries per
day
70k
No of users 450 MAUs
17. Real time Ingest
Backfill data,
verify, publish03
● Backfill data for the cube as far
back as desired
● Verify output is correct
● Make cube accessible to others
Set up streaming02
● Create Kinesis stream
● Start publishing events to stream
Spec and create
data cube01
● Determine measures and
dimensions of cube
● Backfill test data
● Validate cube is correct
18. Batch Ingest
Setup recurring
backfill and
publish
03
● Set up daily update of the cube
● Make cube accessible to others
Backfill data and
verify02
● Backfill data for the cube as far
back as desired
● Verify output is correct
Spec and create
data cube01
● Determine measures and
dimensions of cube
● Backfill test data
● Validate cube is correct
20. Alice & Bob Filter subscriptions by country US
over the past year
Here's the data
Filter over the past 5 years
Here's the data
Here's the data
Filter by country US, UK, Brazil,
South Korea
Pivot
21. Pivot Benefits
● Consistent user interface
● Fast query times
● Shareable links
● Dashboards
● Data exploration
○ Filters
○ Multiple visualizations
○ Highlight and zoom in
○ etc
23. Time for questions
We are hiring
https://www.linkedin.com/in/ny2ko/
23
Thank you!
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.