Self Service Analytics at Twitch

•

0 gefällt mir•197 views

As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports. In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including: -The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau -How the data is brought together to deliver a unified view of live customer engagement and historical trends -Operational best practices we learnt scaling Druid -An example walk through using the platform

Technologie

Self-Service Analytics at
Twitch
September 2020
Nicholas Ngorok, Engineering Manager - Data
Infrastructure, Twitch
1

Twitch
Twitch is where millions of
people come together live
every day to chat, interact,
and make their own
entertainment together.

Data
Infrastructure Develops and operates Twitch's
data platform that powers data
systems and decision-making.
Our data pipeline receives over
80 billion events a day
We provide tools to ingest,
store, transform, move, and
understand data.

Self service
Why we chose Druid and Pivot
How the system works
What's next

Empower
decision
making
Data is a critical part of work
and decision making at Twitch.
Our goal as a team is to
empower people to ﬁnd, access
and use data in their decision
making
Data staﬀ Non data staﬀ

Non Data Staff flow
Identify required
data
Analyze data to
make decision
Request data from
data staff
03
01 02

Data Staff flow
Identify required
data
Analyze data to
make decision
Use BI tool to write
query to retrieve
and present data
03
01 02

Alice & Bob All subscriptions in the US over the
past year
Here's the data
What about over the past 5 years?
Here's the data
Here's the data
How about in the UK, South Korea,
Brazil?

Challenges
● Takes a long time to get
results
● Dependent on Data
Staff
● Repeated cycle to drill in
● Different view
presentation from
different data staff
Non Data
Staff
● Additional work fulfilling requests from
non data staff
● Discovery and understanding of data
● Understanding and translating data
requests
● Different results from different data staff
(quality, inconsistency)
Data Staff

Requirements
● Uniﬁed, consistent user
interface
● Self serve
● Reproducible, shareable
results
● Fast query speed
● Trusted aggregated data
with owners

Scale
Daily events
processed via
kinesis
8.5 billion
1.3 TB
Daily events
ingested via hadoop
5.6 GB
Data Sources 50
Cluster Storage used 80 TB
No of queries per
day
70k
No of users 450 MAUs

Instances
Node Instance Type Number
Historical i3.4xlarge 36
Realtime Middle
manager
r5.4xlarge 20
Batch Middle
manager
r5.4xlarge 16 (Typical)
32 (Backfill)
Master c5.2xlarge 2
Broker r5.2xlarge 2

Real time Ingest
Backﬁll data,
verify, publish03
● Backﬁll data for the cube as far
back as desired
● Verify output is correct
● Make cube accessible to others
Set up streaming02
● Create Kinesis stream
● Start publishing events to stream
Spec and create
data cube01
● Determine measures and
dimensions of cube
● Backﬁll test data
● Validate cube is correct

Batch Ingest
Setup recurring
backﬁll and
publish
03
● Set up daily update of the cube
● Make cube accessible to others
Backﬁll data and
verify02
● Backﬁll data for the cube as far
back as desired
● Verify output is correct
Spec and create
data cube01
● Determine measures and
dimensions of cube
● Backﬁll test data
● Validate cube is correct

Ingest self-serve Tooling
Backfill
SupervisorData cube spec
kinesis
configs
Schema service
Real time Ingest

Alice & Bob Filter subscriptions by country US
over the past year
Here's the data
Filter over the past 5 years
Here's the data
Here's the data
Filter by country US, UK, Brazil,
South Korea
Pivot

Pivot Benefits
● Consistent user interface
● Fast query times
● Shareable links
● Dashboards
● Data exploration
○ Filters
○ Multiple visualizations
○ Highlight and zoom in
○ etc

What's next
Simplify cube creation Expand coverage of data
sources

November 2-4, 2020
San Francisco, CA
druidsummit.org
24
Register Now for
Druid Summit

Empfohlen

Archmage, Pinterest’s Real-time Analytics Platform on DruidImply

Building a Real-Time Gaming Analytics Service with Apache DruidImply

Splunk: Druid on Kubernetes with Druid-operatorImply

Apache Druid®: A Dance of Distributed ProcessesImply

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Apache Pinot Meetup Sept02, 2020Mayank Shrivastava

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

Empfohlen

Archmage, Pinterest’s Real-time Analytics Platform on DruidImply

Building a Real-Time Gaming Analytics Service with Apache DruidImply

Splunk: Druid on Kubernetes with Druid-operatorImply

Apache Druid®: A Dance of Distributed ProcessesImply

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Apache Pinot Meetup Sept02, 2020Mayank Shrivastava

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Building an open data platform with apache icebergAlluxio, Inc.

Masterclass - RedshiftAmazon Web Services

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Data Pipline Observability meetup Omid Vahdaty

Data Lakes - The Key to a Scalable Data ArchitectureZaloni

How to Extend Apache Spark with Customized OptimizationsDatabricks

Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden

Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker

Google Cloud and Data Pipeline PatternsLynn Langit

Pinot: Near Realtime Analytics @ UberXiang Fu

Real Use Cases - Pentaho & Big Data Ecosystem Xpand IT

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Phar Data Platform: From the Lakehouse Paradigm to the RealityDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Architecting a datalakeLaurent Leturgez

Siligong.Data - May 2021 - Transforming your analytics workflow with dbtJon Su

Azure Stream Analytics : Analyse Data in MotionRuhani Arora

Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis

Weitere ähnliche Inhalte

Was ist angesagt?

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Building an open data platform with apache icebergAlluxio, Inc.

Masterclass - RedshiftAmazon Web Services

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Data Pipline Observability meetup Omid Vahdaty

Data Lakes - The Key to a Scalable Data ArchitectureZaloni

How to Extend Apache Spark with Customized OptimizationsDatabricks

Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden

Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker

Google Cloud and Data Pipeline PatternsLynn Langit

Pinot: Near Realtime Analytics @ UberXiang Fu

Real Use Cases - Pentaho & Big Data Ecosystem Xpand IT

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Phar Data Platform: From the Lakehouse Paradigm to the RealityDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Architecting a datalakeLaurent Leturgez

Siligong.Data - May 2021 - Transforming your analytics workflow with dbtJon Su

Was ist angesagt? (20)

Iceberg: A modern table format for big data (Strata NY 2018)

Building an open data platform with apache iceberg

Masterclass - Redshift

Iceberg + Alluxio for Fast Data Analytics

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Apache Iceberg - A Table Format for Hige Analytic Datasets

Data Pipline Observability meetup

Data Lakes - The Key to a Scalable Data Architecture

How to Extend Apache Spark with Customized Optimizations

Netflix viewing data architecture evolution - QCon 2014

Re:invent 2016 Container Scheduling, Execution and AWS Integration

Google Cloud and Data Pipeline Patterns

Pinot: Near Realtime Analytics @ Uber

Real Use Cases - Pentaho & Big Data Ecosystem

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Phar Data Platform: From the Lakehouse Paradigm to the Reality

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Architecting a datalake

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

Ähnlich wie Self Service Analytics at Twitch

Azure Stream Analytics : Analyse Data in MotionRuhani Arora

Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis

IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamgogo6

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Nicola Sandoli

Predictive Analytics World Chicago 2015Dan Potter

Advanced Analytics for Any Data at Real-Time Speeddanpotterdwch

A Connected Data Landscape: Virtualization and the Internet of ThingsInside Analysis

TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh

StreamCentral for the IT ProfessionalRaheel Retiwalla

213 event processingtalk-deviewkorea.keyNAVER D2

ironSource Atom BigData New-YorkShimon Tolts

ironSource Atom BigData BerlinShimon Tolts

Shimon Tolts, R&D Manager, IronSource - "Data Flow Management"Dataconomy Media

Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA

Using ClickHouse for ExperimentationGleb Kanterov

Analytics in Your EnterpriseWSO2

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY

Elasticsearch : petit déjeuner du 13 mars 2014ALTER WAY

Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon

Ähnlich wie Self Service Analytics at Twitch (20)

Azure Stream Analytics : Analyse Data in Motion

Moving Targets: Harnessing Real-time Value from Data in Motion

IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025

Predictive Analytics World Chicago 2015

Advanced Analytics for Any Data at Real-Time Speed

A Connected Data Landscape: Virtualization and the Internet of Things

TIBCO Advanced Analytics Meetup (TAAM) - June 2015

StreamCentral for the IT Professional

213 event processingtalk-deviewkorea.key

ironSource Atom BigData New-York

ironSource Atom BigData Berlin

Shimon Tolts, R&D Manager, IronSource - "Data Flow Management"

Time Difference: How Tomorrow's Companies Will Outpace Today's

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...

Using ClickHouse for Experimentation

Analytics in Your Enterprise

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014

Elasticsearch : petit déjeuner du 13 mars 2014

Using The Internet of Things for Population Health Management - StampedeCon 2016

Mehr von Imply

Pivot 2.0 - The next generation visualization tool for your streaming dataImply

Druid Adoption Tips and TricksImply

Druid in Spot InstancesImply

Zeotap: Data Modeling in Druid for Non temporal and Nested DataImply

Nielsen: Casting the Spell - Druid in PracticeImply

Building Data Applications with Apache DruidImply

Maximizing Apache Druid performance: Beyond the basicsImply

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Imply

How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...Imply

August meetup - All about Apache Druid Imply

Benchmarking Apache DruidImply

Druid: Under the Covers (Virtual Meetup)Imply

Why data warehouses cannot support hot analyticsImply

What’s New in Imply 3.3 & Apache Druid 0.18Imply

Apache Druid Vision and RoadmapImply

Analytics over Terabytes of Data at TwitterImply

Mehr von Imply (17)

Pivot 2.0 - The next generation visualization tool for your streaming data

Druid Adoption Tips and Tricks

Druid in Spot Instances

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Nielsen: Casting the Spell - Druid in Practice

Building Data Applications with Apache Druid

Maximizing Apache Druid performance: Beyond the basics

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...

August meetup - All about Apache Druid

Benchmarking Apache Druid

Druid: Under the Covers (Virtual Meetup)

Why data warehouses cannot support hot analytics

What’s New in Imply 3.3 & Apache Druid 0.18

Apache Druid Vision and Roadmap

Analytics over Terabytes of Data at Twitter

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Real Time Object Detection Using Open CVKhem

🐬 The future of MySQL is Postgres 🐘RTylerCroy

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Developing An App To Navigate The Roads of BrazilV3cube

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

A Year of the Servo Reboot: Where Are We Now?Igalia

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf

Handwritten Text Recognition for manuscripts and early printed texts

Real Time Object Detection Using Open CV

🐬 The future of MySQL is Postgres 🐘

presentation ICT roal in 21st century education

Boost PC performance: How more available memory can improve productivity

Developing An App To Navigate The Roads of Brazil

Data Cloud, More than a CDP by Matt Robison

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Driving Behavioral Change for Information Management through Data-Driven Gree...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Year of the Servo Reboot: Where Are We Now?

Axa Assurance Maroc - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Self Service Analytics at Twitch

1. Self-Service Analytics at Twitch September 2020 Nicholas Ngorok, Engineering Manager - Data Infrastructure, Twitch 1

2. Twitch Twitch is where millions of people come together live every day to chat, interact, and make their own entertainment together.

4. Data Infrastructure Develops and operates Twitch's data platform that powers data systems and decision-making. Our data pipeline receives over 80 billion events a day We provide tools to ingest, store, transform, move, and understand data.

5. Self service Why we chose Druid and Pivot How the system works What's next

6. Self service

7. Empower decision making Data is a critical part of work and decision making at Twitch. Our goal as a team is to empower people to find, access and use data in their decision making Data staff Non data staff

8. Non Data Staff flow Identify required data Analyze data to make decision Request data from data staff 03 01 02

9. Data Staff flow Identify required data Analyze data to make decision Use BI tool to write query to retrieve and present data 03 01 02

10. Alice & Bob All subscriptions in the US over the past year Here's the data What about over the past 5 years? Here's the data Here's the data How about in the UK, South Korea, Brazil?

11. Challenges ● Takes a long time to get results ● Dependent on Data Staff ● Repeated cycle to drill in ● Different view presentation from different data staff Non Data Staff ● Additional work fulfilling requests from non data staff ● Discovery and understanding of data ● Understanding and translating data requests ● Different results from different data staff (quality, inconsistency) Data Staff

12. Druid and Imply Pivot

13. Requirements ● Uniﬁed, consistent user interface ● Self serve ● Reproducible, shareable results ● Fast query speed ● Trusted aggregated data with owners

14. Scale Daily events processed via kinesis 8.5 billion 1.3 TB Daily events ingested via hadoop 5.6 GB Data Sources 50 Cluster Storage used 80 TB No of queries per day 70k No of users 450 MAUs

15. Architecture

16. Instances Node Instance Type Number Historical i3.4xlarge 36 Realtime Middle manager r5.4xlarge 20 Batch Middle manager r5.4xlarge 16 (Typical) 32 (Backfill) Master c5.2xlarge 2 Broker r5.2xlarge 2

17. Real time Ingest Backfill data, verify, publish03 ● Backfill data for the cube as far back as desired ● Verify output is correct ● Make cube accessible to others Set up streaming02 ● Create Kinesis stream ● Start publishing events to stream Spec and create data cube01 ● Determine measures and dimensions of cube ● Backfill test data ● Validate cube is correct

18. Batch Ingest Setup recurring backfill and publish 03 ● Set up daily update of the cube ● Make cube accessible to others Backfill data and verify02 ● Backfill data for the cube as far back as desired ● Verify output is correct Spec and create data cube01 ● Determine measures and dimensions of cube ● Backfill test data ● Validate cube is correct

19. Ingest self-serve Tooling Backfill SupervisorData cube spec kinesis configs Schema service Real time Ingest

20. Alice & Bob Filter subscriptions by country US over the past year Here's the data Filter over the past 5 years Here's the data Here's the data Filter by country US, UK, Brazil, South Korea Pivot

21. Pivot Benefits ● Consistent user interface ● Fast query times ● Shareable links ● Dashboards ● Data exploration ○ Filters ○ Multiple visualizations ○ Highlight and zoom in ○ etc

22. What's next Simplify cube creation Expand coverage of data sources

23. Time for questions We are hiring https://www.linkedin.com/in/ny2ko/ 23 Thank you! Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

24. November 2-4, 2020 San Francisco, CA druidsummit.org 24 Register Now for Druid Summit