Weitere ähnliche Inhalte Ähnlich wie TiVo: How to Scale New Products with a Data Lake on AWS and Qubole (20) Mehr von Amazon Web Services (20) TiVo: How to Scale New Products with a Data Lake on AWS and Qubole1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
June 7, 2018 | 10:00 AM PT
Tivo: How to scale new products
with a data lake on AWS and
Qubole
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s presenters
Paul Sears, Partner Solutions Architect, Amazon Web Services
Harsh Jetly, Solutions Architect, Qubole
Ashish Mrig, Senior Manager, Big Data Analytics, TiVo
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s agenda
1. An overview of Amazon Web Services (AWS) with an
emphasis on AWS data lake solutions and Qubole
2. Overview of the Qubole solutions featured in our story
3. Challenges faced by TiVo
4. The TiVo success story with AWS and Qubole
5. Q&A/Discussion
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning objectives:
1. How to dramatically reduce management complexities for big data
analytics operations on AWS
2. Best practices for optimizing data lakes for self-service analytics that
enable teams to productionize data science and accelerate data
pipelines
3. Using Presto with Qubole’s auto-scaling management and Spot
Instance Bidding to reduce the complexity, cost, and deployment time
of big data projects
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The data lake and AWS
Drive business value with any type of data
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy data warehouses and RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new data
sources
• Queries take too long
• Cost $MM upfront
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Should I build a data lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets.
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink how to become a data-driven business
1. Business outcomes - start with the insights and
actions you want to drive, then work backwards to a
streamlined design
2. Experimentation - start small, test many ideas, keep
the good ones and scale those up, paying only for what
you consume
3. Agile and timely - deploy data processing
infrastructure in minutes, not months and take
advantage of a rich platform of services to respond
quickly to changing business needs
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business outcomes on a modern data
architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business case determines platform design
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
START HERE
WITH A BUSINESS CASE
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Experiment and scale based on your business
needsMATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon S3 for modern data architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Amazon Athena
IntegratedEasy to use
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple storage and compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big data architectures often benefit
from decoupling storage and compute
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake on AWS
AWS
Snowball
AWS
Snowmobile
Amazon
Kinesis
Data Firehose
Amazon
Kinesis
Data Streams
S3
Relational and non-relational data
Schema defined during analysis
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store data at $0.023 / month; Query for $0.05/GB
scanned
Amazon
Redshift
Amazo
n
EMR
Amazo
nAthen
a
Amazo
n
Kinesis Amazon
Elasticsearch Service
Amazon
Kinesis
Video Streams
AI Services
15. Big data activation for data-
driven companies
Harsh Jetly, Solutions
Architect
Looking at big data operations workflow
16. 16Copyright 2018 © Qubole
Data teams are getting overrun
increasing workloads, costs and risks
Copyright 2018 © Qubole
Petabytes of Data
Big Data Infrastructure
Not enough
expertise to go
around: 190K
unfilled jobs in
US alone
Manual
provisioning
makes it
impossible to
scale
Exploding data,
changing workloads,
new data types
overwhelm data team
Missed SLAs:
data delayed is
data denied
More users
want on
demand access
to data
Data teams under
pressure
17. 17Copyright 2018 © Qubole
Consequence: The Activation Gap
You can’t afford to activate everyone with current economics
Copyright 2018 © Qubole
THE
ACTIVATION
GAP
Growth
Use cases
and Tools
Users and their
expectations
Supply of Big Data skills
IT budget
Volume and
variety of data
Time
Data
security
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
00Copyright 2017 © Qubole
provides your teams the ability
to collaborate and onboard
new projects quickly
Big data can be successful with modern
data lake architecture -
that scales to allow your
Data Teams and Use Cases
to grow with the company
enables your teams
to iterate and
prototype quickly
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
19Copyright 2018 © QuboleCopyright 2018 © Qubole
The transformational promise of
big data workloads are moving to the cloud
58%of big data projects
were on the cloud in
2017*
73%are running big data
projects this year*
*according to dimensional research study
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
20Copyright 2018 © Qubole
AVRO AVRO
Raw
(Staged)
Semi-Structured
Derived
Analytics
‘Source of Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB, etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalization,
Recommendation etc.)
Data Science
(i.e. Time-series Analysis,
Research etc.)
Data Discovery
(i.e. Exploration, Lineage,
Defined Tables)
Machine
Learning (batch
+ continuous)
Cloud
Compute
Data Lake
Storage
Typical data lake operation
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
21Copyright 2018 © Qubole
What is the status of your big data initiative?
Deployed but need to reduce cost/complexity of infrastructure
Expanding deployments, adding more data, users or workloads
Initial use case deployed but need help to expand
Have not deployed big data but researching how to do it
No intention to deploy big data in the next 12 months
22. 22Copyright 2018 © QuboleCopyright 2018 © Qubole
NEXT: FULLY ACTIVATED DATANOW: ACTIVATION GAP to
The imperative:
Shift to a big data activation strategy
Data silos Shared, governed data access
10% active / 90% inert data 90% active / 10% inert data
1:10 ops/users, throw bodies at problem 1:200 ops/users: run on automation + ML
Serviced access to data, tools Self service, collaborative access to data, tools
Focus on infrastructure Focus on business impact
Upside down speed and economics Operate with machine-speed economic
23. 23Copyright 2018 © Qubole
Big data activation stack
2
3
Copyright 2018 © Qubole
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Big Data Cloud Activation Platform
Autoscaling Caching Spot buying
Alerts &
Insights
Serverless …
…
Cloud Data Lake
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
24Copyright 2018 © Qubole
A deeper look at autoscaling
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
00Copyright 2017 © Qubole
About the Report
In 2017, 54% of all Amazon EC2 compute hours used were spot instances,
resulting in an estimated $230 million in savings of Amazon EC2 costs.*
Spot instance adoption
*Qubole Big Data Activation Report 2018
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
26Copyright 2018 © Qubole
Cluster Life
Cycle
Management
$150M
Workload-aware
Autoscaling
$121M
Spot Shoper
$40M
Cluster Lifecycle Savings
– Amount saved by automatically
terminating a cluster when inactive
Workload-aware Autoscaling Savings
– Amount saved by predictively adjusting
the number of nodes to meet demand
Spot Shopper Savings
– Amount saved by utilizing Amazon EC2
Spot Instances reliably
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
27Copyright 2018 © Qubole
How do you deploy big data today?
On-premises managing big data software and hardware
Co-location. 3rd party manages on-premises big data
In the cloud. You manage big data and cloud infrastructure
Cloud SaaS. Multi-tenancy big data service from cloud provider
SaaS vendor. Multi-tenancy big data service from 3rd party
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
28Copyright 2018 © Qubole
How do you deploy big data today?
On-premise managing big data software and hardware
Co-location. 3rd party manages on-premise big data software and hardware
In the cloud managing big data software and cloud infrastructure (EC2, etc.)
Cloud provider SaaS. Multi-tenancy big data service managed by Cloud Provider
3rd party vendor SaaS. Multi-tenancy big data service managed by 3rd party company
None of the above
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
29Copyright 2018 © Qubole
162%Growth in Open
Source Engine Usage
Globally
298% growth in Apache Spark
420% growth in Presto
102% growth in Apache Hadoop/Hive
Total Engine Usage Globally By Compute Hours
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
30Copyright 2018 © Qubole
Movement to
multi-engine
Companies are increasingly
deploying multiple OSS
engines for different use
cases (ML, ETL, analytics,
etc.)
Users getting
more access
More users have access
to data and are running
more commands and
collaborating
Cloud benefits
recognized
Companies are
leveraging cloud for rapid
innovation and
automation to scale
34. Why Presto ?
• Storage/Compute Separation
• Easy to add and remove worker nodes
• Query many different data sources (inside our VPC)
without separate load
• Good performance for analytical queries.
Not so good for transactional and simple queries…
• Managed (e.g., Qubole, Starburst)
37. Memory Pools:
• System memory pool (40% of Java heap space)
• Reserved memory pool (largest query’s memory usage)
• General memory pool (the rest of the memory)
38. • What if memory usage varies a lot between different queries?
• Use many inexpensive instances, or a few expensive instances?
• Compute optimized or memory optimized?
Working with reserved memory pool
How do we achieve that?
Conceptually, reserved memory pool should be the “high water mark”
while most queries complete in the general pool.
Solution: multiple clusters based on workload
Empiric testing found large instance type was slightly faster
Solution: Cost/Benefit Analysis
39. Choosing the Right Instance Type
r 4 . 4 x l a r g e
Instance
Class
Generation
Multiplier
For CPU and Mem
t 2 . 2 x l a r g e
c 5 . 16x l a r g e
Over 100 to choose from!
41. Choosing the Right Instance Type
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
42. Better for larger
memory clusters
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
43. Better for smaller
memory clusters
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
50. My big fat Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Not fast enough!
100% CPU 100% CPU
51. Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Upscaling only works for new queries
Presto
Worker
Presto
Worker
100% CPU 100% CPUIdle Idle
Not so fast…
Not fast enough!
Maybe we should have sent this
query to a more powerful cluster?
Autoscaling is for concurrency
52. Results
Elastic scaling: Spin the nodes up/down based on demand
Benefit: Cost savings
Specialized clusters: Different clusters for different workload
Benefit: Efficiency
Storage/Compute separation: Store on Amazon S3, serve using Presto
Benefit: Scalability and data availability
53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Get started with Qubole:
https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/
• Try AWS for free:
https://aws.amazon.com/
54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q & A
55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!