TiVo: How to Scale New Products with a Data Lake on AWS and Qubole

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
June 7, 2018 | 10:00 AM PT
Tivo: How to scale new products
with a data lake on AWS and
Qubole
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Today’s presenters
Paul Sears, Partner Solutions Architect, Amazon Web Services
Harsh Jetly, Solutions Architect, Qubole
Ashish Mrig, Senior Manager, Big Data Analytics, TiVo

Today’s agenda
1. An overview of Amazon Web Services (AWS) with an
emphasis on AWS data lake solutions and Qubole
2. Overview of the Qubole solutions featured in our story
3. Challenges faced by TiVo
4. The TiVo success story with AWS and Qubole
5. Q&A/Discussion

Learning objectives:
1. How to dramatically reduce management complexities for big data
analytics operations on AWS
2. Best practices for optimizing data lakes for self-service analytics that
enable teams to productionize data science and accelerate data
pipelines
3. Using Presto with Qubole’s auto-scaling management and Spot
Instance Bidding to reduce the complexity, cost, and deployment time
of big data projects

The data lake and AWS
Drive business value with any type of data

Legacy data warehouses and RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new data
sources
• Queries take too long
• Cost $MM upfront

Should I build a data lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets.

Rethink how to become a data-driven business
1. Business outcomes - start with the insights and
actions you want to drive, then work backwards to a
streamlined design
2. Experimentation - start small, test many ideas, keep
the good ones and scale those up, paying only for what
you consume
3. Agile and timely - deploy data processing
infrastructure in minutes, not months and take
advantage of a rich platform of services to respond
quickly to changing business needs

Business outcomes on a modern data
architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure

Business case determines platform design
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
START HERE
WITH A BUSINESS CASE

Experiment and scale based on your business
needsMATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights

Why Amazon S3 for modern data architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 Amazon EMR
 Amazon Redshift
 Amazon DynamoDB
 Amazon Athena
IntegratedEasy to use
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies

Decouple storage and compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big data architectures often benefit
from decoupling storage and compute

Data lake on AWS
AWS
Snowball
AWS
Snowmobile
Amazon
Kinesis
Data Firehose
Amazon
Kinesis
Data Streams
S3
Relational and non-relational data
Schema defined during analysis
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store data at $0.023 / month; Query for $0.05/GB
scanned
Amazon
Redshift
Amazo
n
EMR
Amazo
nAthen
a
Amazo
n
Kinesis Amazon
Elasticsearch Service
Amazon
Kinesis
Video Streams
AI Services

Big data activation for data-
driven companies
Harsh Jetly, Solutions
Architect
Looking at big data operations workflow

16Copyright 2018 © Qubole
Data teams are getting overrun
increasing workloads, costs and risks
Copyright 2018 © Qubole
Petabytes of Data
Big Data Infrastructure
Not enough
expertise to go
around: 190K
unfilled jobs in
US alone
Manual
provisioning
makes it
impossible to
scale
Exploding data,
changing workloads,
new data types
overwhelm data team
Missed SLAs:
data delayed is
data denied
More users
want on
demand access
to data
Data teams under
pressure

Consequence: The Activation Gap
You can’t afford to activate everyone with current economics
THE
ACTIVATION
GAP
Growth
Use cases
and Tools
Users and their
expectations
Supply of Big Data skills
IT budget
Volume and
variety of data
Time
Data
security

provides your teams the ability
to collaborate and onboard
new projects quickly
Big data can be successful with modern
data lake architecture -
that scales to allow your
Data Teams and Use Cases
to grow with the company
enables your teams
to iterate and
prototype quickly

19Copyright 2018 © QuboleCopyright 2018 © Qubole
The transformational promise of
big data workloads are moving to the cloud
58%of big data projects
were on the cloud in
2017*
73%are running big data
projects this year*
*according to dimensional research study

AVRO AVRO
Raw
(Staged)
Semi-Structured
Derived
Analytics
‘Source of Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB, etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalization,
Recommendation etc.)
Data Science
(i.e. Time-series Analysis,
Research etc.)
Data Discovery
(i.e. Exploration, Lineage,
Defined Tables)
Machine
Learning (batch
+ continuous)
Cloud
Compute
Data Lake
Storage
Typical data lake operation

What is the status of your big data initiative?
 Deployed but need to reduce cost/complexity of infrastructure
 Expanding deployments, adding more data, users or workloads
 Initial use case deployed but need help to expand
 Have not deployed big data but researching how to do it
 No intention to deploy big data in the next 12 months

22Copyright 2018 © QuboleCopyright 2018 © Qubole
NEXT: FULLY ACTIVATED DATANOW: ACTIVATION GAP to
The imperative:
Shift to a big data activation strategy
Data silos Shared, governed data access
10% active / 90% inert data 90% active / 10% inert data
1:10 ops/users, throw bodies at problem 1:200 ops/users: run on automation + ML
Serviced access to data, tools Self service, collaborative access to data, tools
Focus on infrastructure Focus on business impact
Upside down speed and economics Operate with machine-speed economic

Big data activation stack
2
3
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Big Data Cloud Activation Platform
Autoscaling Caching Spot buying
Alerts &
Insights
Serverless …
…
Cloud Data Lake

A deeper look at autoscaling

About the Report
In 2017, 54% of all Amazon EC2 compute hours used were spot instances,
resulting in an estimated $230 million in savings of Amazon EC2 costs.*
Spot instance adoption
*Qubole Big Data Activation Report 2018

Cluster Life
Cycle
Management
$150M
Workload-aware
Autoscaling
$121M
Spot Shoper
$40M
Cluster Lifecycle Savings
– Amount saved by automatically
terminating a cluster when inactive
Workload-aware Autoscaling Savings
– Amount saved by predictively adjusting
the number of nodes to meet demand
Spot Shopper Savings
– Amount saved by utilizing Amazon EC2
Spot Instances reliably

How do you deploy big data today?
 On-premises managing big data software and hardware
 Co-location. 3rd party manages on-premises big data
 In the cloud. You manage big data and cloud infrastructure
 Cloud SaaS. Multi-tenancy big data service from cloud provider
 SaaS vendor. Multi-tenancy big data service from 3rd party

How do you deploy big data today?
 On-premise managing big data software and hardware
 Co-location. 3rd party manages on-premise big data software and hardware
 In the cloud managing big data software and cloud infrastructure (EC2, etc.)
 Cloud provider SaaS. Multi-tenancy big data service managed by Cloud Provider
 3rd party vendor SaaS. Multi-tenancy big data service managed by 3rd party company
 None of the above

162%Growth in Open
Source Engine Usage
Globally
298% growth in Apache Spark
420% growth in Presto
102% growth in Apache Hadoop/Hive
Total Engine Usage Globally By Compute Hours

Movement to
multi-engine
Companies are increasingly
deploying multiple OSS
engines for different use
cases (ML, ETL, analytics,
etc.)
Users getting
more access
More users have access
to data and are running
more commands and
collaborating
Cloud benefits
recognized
Companies are
leveraging cloud for rapid
innovation and
automation to scale

How is Presto used?
Targeted Audience Delivery

Targeted Audience Delivery
brought to you (in part) by

Why Presto ?
• Storage/Compute Separation
• Easy to add and remove worker nodes
• Query many different data sources (inside our VPC)
without separate load
• Good performance for analytical queries.
Not so good for transactional and simple queries…
• Managed (e.g., Qubole, Starburst)

How Presto
Works
Data is streamed
back to the workers

Lesson learned:
What instance types should we use?

Memory Pools:
• System memory pool (40% of Java heap space)
• Reserved memory pool (largest query’s memory usage)
• General memory pool (the rest of the memory)

• What if memory usage varies a lot between different queries?
• Use many inexpensive instances, or a few expensive instances?
• Compute optimized or memory optimized?
Working with reserved memory pool
How do we achieve that?
Conceptually, reserved memory pool should be the “high water mark”
while most queries complete in the general pool.
Solution: multiple clusters based on workload
Empiric testing found large instance type was slightly faster
Solution: Cost/Benefit Analysis

Choosing the Right Instance Type
r 4 . 4 x l a r g e
Instance
Class
Generation
Multiplier
For CPU and Mem
t 2 . 2 x l a r g e
c 5 . 16x l a r g e
Over 100 to choose from!

Credit: Willard Simmons (DataXu)

Newer instances are
more efficient

Better for larger
memory clusters
Newer instances are
more efficient

Better for smaller
memory clusters
Newer instances are
more efficient

Lesson learned:
Elastic Scaling

Average Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?

Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
at current rate?
Not fast enough!
Concurrent Presto queries

Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
at current rate?
Qubole provisions more nodes up to a limit
(around 3 minutes)
Presto
Worker
Presto
Worker
More concurrency? Scale up

Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
at current rate?
Presto
Worker
Presto
Worker
Too fast!
Back to single Presto query

Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
at current rate?
Qubole decommissions more nodes up to a limit
Scale down

My big fat Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
at current rate?
Not fast enough!
100% CPU 100% CPU

Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
at current rate?
Upscaling only works for new queries
Presto
Worker
Presto
Worker
100% CPU 100% CPUIdle Idle
Not so fast…
Not fast enough!
Maybe we should have sent this
query to a more powerful cluster?
Autoscaling is for concurrency

Results
Elastic scaling: Spin the nodes up/down based on demand
Benefit: Cost savings
Specialized clusters: Different clusters for different workload
Benefit: Efficiency
Storage/Compute separation: Store on Amazon S3, serve using Presto
Benefit: Scalability and data availability

Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Get started with Qubole:
https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/
• Try AWS for free:
https://aws.amazon.com/

Q & A

Thank you!

TiVo: How to Scale New Products with a Data Lake on AWS and Qubole

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie TiVo: How to Scale New Products with a Data Lake on AWS and Qubole

Ähnlich wie TiVo: How to Scale New Products with a Data Lake on AWS and Qubole (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

TiVo: How to Scale New Products with a Data Lake on AWS and Qubole