SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Big Data on Cloud Native
Platform
Rajesh Balamohan
Sunil Govindan
Speaker Bio
Rajesh Balamohan
Principal Engineer 2 @ Cloudera
Apache Hive, ORC Committer & Apache Tez PMC and Committer
@rajeshbalamohan
Sunil Govindan
Engineering Manager @ Cloudera
Apache Hadoop, Submarine, YuniKorn PMC member & Committer
@sunilgovind
Agenda
● Why Big Data workloads need to migrate to Cloud
● Aspects of Enterprise Ready Cloud Platform
● Challenges of Big Data on Cloud Platform
Why Big Data workloads need to migrate to cloud ?
About (Big) Data itself...
Key thought process from the customers about today’s DATA are,
“Ability to consistently extract accurate business proposition from data”
“Data will grow over time - probably, exponentially”
“Data analytics returns profound business insights only when you have access to
more data”
So how do we keep data available as needed (to get value from that data) ?
Data Architecture Evolution: Gen 1
Data volumes are growing
exponentially and on-prem is
not cost effective & scalable!
Cloud Adoption Trend
“The worldwide infrastructure as a service (IaaS) market grew 37.3% in 2019 to
total $44.5 billion, up from $32.4 billion in 2018, according to Gartner, Inc.”
Cloud Adoption is growing at a rapid pace, why ?
“Cloud computing offers access to data storage and compute on a more
scalable, flexible and cost-effective than can be achieved with an on-
premises deployment”
Why Big Data workloads need Cloud?
Some high level advantages:
● Pay as you go : No hardware acquisitions, thus Zero CAPEX
● Self Serve : Easier Accessibility
● Cost Effective & On-Demand
● Highly Elastic : Can scale 100s of nodes up/down easily
● No more installation/upgrade hassles
● Disaggregated Storage
Data Architecture Evolution: Gen 2
Hadoop in the Public Cloud!
Big Data in Cloud
Hadoop: “Decade Two, Day Zero”
Philosophy towards a modern Data Architecture
● Disaggregate storage, compute, security and governance
● Build for extremely large-scale using distributed systems
● Leverage open source for open standards and community scale
● Continuously evolve the ecosystem for innovation at every layer,
independently
Data Architecture Evolution: Gen 3
Aspects of Enterprise Ready Cloud Platform
Critical Aspects of Enterprise Cloud Platform
● Manage and monitor multiple
clusters
● Secure data via single window
● Authentication & Authorization via
single window
● Replicate data across multiple
clusters on need basis
● Profile and debug queries across
multiple clusters via single window
● Multiple experiences depending on
the user (Data Engineering,
Streaming, Fast Analytics, Data
profiling etc)
Classic Clusters
(Optional)
Manage multiple clusters in central place
Ability to have control over the data end to end
Provide access & control of data to end-users right from ingestion phase to
prediction phase.
Big Data Challenges on Cloud Platform
Challenges in the dimension of
- Storage
- Network
- Compute
- Throttling
- Security
- Hardware Specs
* These are some of dimensions that we would like to cover in today’s talk.
Consistency & Latency Issues with ObjectStores
● Eventual Consistency Issues
○ Certain ObjectStores provide eventual consistency (e.g S3)
■ New files may not be visible for listing (until safely propagated internally).
■ Opening deleted file may be possible due to consistency issues
○ S3Guard
■ Uses “DynamoDB” to persist metadata changes. Provides consistent view of S3
objects for processing.
■ Supports DynamoDB on-demand (i.e no need to explicitly set capacity limits).
● Renames can be expensive
○ Rename = “Copy + Delete” in ObjectStores like S3.
○ Need to build stack which reduces rename operations or favours direct write to
destination
● OS Page cache is not leveraged as data is read over network
Intelligent Caching for Query Performance
● Avoid reading same data from
ObjectStores
○ Systems like Hive/LLAP and Impala
cache data locally for improving query
performance.
Reduce Network Latency
● Reduce number of SSL
connections to
ObjectStores
○ Added lazySeek
implementation to reduce
connection breakages.
AutoScaling
● Determining the right cluster size can
be challenging.
● AutoScaling helps in scaling up/down
instances depending on workload
○ Concurrency Based AutoScaling
■ Helps in controlling number of
parallel queries
○ Query Isolation
■ When queries scan beyond a certain
limit, new clusters are automatically
spun up.
Affinity Policies for better Network Throughput
- AutoScaling policies allow you spin up instances across different
availability zones
- By default cloud providers tend to spread instances across AZ for availability.
- Impacts network throughput for nodes with 10Gbps speed
- Set affinity policy to have the instances in the same availability zone
Spin up Time
● Cluster/Compute spin up time plays a crucial role in adoption and
reducing cost.
● Containerized deployments help a lot in reducing spin up time
significantly with K8S
○ 10s of seconds as opposed to minutes
K8S: Pods can have same hostname/port
● Pods can have same hostname/port after restart
● This causes trouble for processes tracking nodes based on
hostname/port
● Added flexibility in the stack to take care of this situation
○ E.g TEZ-4179: [Kubernetes] Extend NodeId in tez to support unique worker identity
Throttling
● Cloud services throttle
requests
○ Throttling limits vary across cloud
vendors
● Critical to monitor throttling
metrics
○ Desirable to enable metrics
logging in ObjectStore
○ Accuracy limited to per minute in
most of the objectstores
Throttling
System trying to resend data over SSL on receiving 503 (throttling) causing CPU spike
Security
● Perimeter Security
● Encrypted data at rest
● Transfer of intermediate data encrypted
● Need to use optimised libs for improving transport security
Hardware Specs across Cloud Vendors
● Watch out for hardware specs across cloud vendors.
○ E.g SSD in Azure can have different perf characteristics than AWS
● OS settings have to be tweaked accordingly
○ E.g network, disk settings
● Choose optimal instance for the workload
○ E.g Instances with high density disks may not be needed as data is stored in ObjectStore
○ Too little disk space can hurt intermediate data being written out.
Tomorrow ...
● Plenty of challenges to run Big Data workloads on Cloud
○ Great efforts from Open Source community!
● Users need “No vendor lock in”
○ An Open Data layer for multi-cloud (SODA, CSI etc with infinite possibilities)
○ Network standards across clouds (CNI)
○ Data Lineage and governance for user (Apache Atlas)
○ Security and access as open standard (Apache Ranger)
● Users are looking for an Open Data Architecture for multiple clouds which
is enterprise ready!
Thank You
● References
○ Cloudera Data Platform (Multi Cloud): https://docs.cloudera.com/cdp/latest/index.html
○ Hadoop: Decade two, Day zero: https://blog.cloudera.com/hadoop-decade-two-day-zero/
● Cloudera careers
Q/A

Weitere ähnliche Inhalte

Was ist angesagt?

In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! Uri Cohen
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
 
Big Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationBig Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Databricks
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Cloudian
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...Splunk
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Fwdays
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.
 

Was ist angesagt? (20)

In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
 
Big Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data IntegrationBig Data Quickstart Series 3: Perform Data Integration
Big Data Quickstart Series 3: Perform Data Integration
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Cloudian HyperStore Operating Environment
Cloudian HyperStore Operating EnvironmentCloudian HyperStore Operating Environment
Cloudian HyperStore Operating Environment
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
 
Zabbix at scale with Elasticsearch
Zabbix at scale with ElasticsearchZabbix at scale with Elasticsearch
Zabbix at scale with Elasticsearch
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at Robinhood
 

Ähnlich wie Big Data on Cloud Native Platform

Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session IVMware Tanzu
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
 
A Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationA Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationDenodo
 
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWSAWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWSAmazon Web Services
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”Hardway Hou
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
 
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)Denodo
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practicesOmid Vahdaty
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 

Ähnlich wie Big Data on Cloud Native Platform (20)

Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session I
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
A Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data VirtualizationA Successful Journey to the Cloud with Data Virtualization
A Successful Journey to the Cloud with Data Virtualization
 
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWSAWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”
 
oracle.pptx
oracle.pptxoracle.pptx
oracle.pptx
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 

Kürzlich hochgeladen

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Kürzlich hochgeladen (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 

Big Data on Cloud Native Platform

  • 1. Big Data on Cloud Native Platform Rajesh Balamohan Sunil Govindan
  • 2. Speaker Bio Rajesh Balamohan Principal Engineer 2 @ Cloudera Apache Hive, ORC Committer & Apache Tez PMC and Committer @rajeshbalamohan Sunil Govindan Engineering Manager @ Cloudera Apache Hadoop, Submarine, YuniKorn PMC member & Committer @sunilgovind
  • 3. Agenda ● Why Big Data workloads need to migrate to Cloud ● Aspects of Enterprise Ready Cloud Platform ● Challenges of Big Data on Cloud Platform
  • 4. Why Big Data workloads need to migrate to cloud ?
  • 5. About (Big) Data itself... Key thought process from the customers about today’s DATA are, “Ability to consistently extract accurate business proposition from data” “Data will grow over time - probably, exponentially” “Data analytics returns profound business insights only when you have access to more data” So how do we keep data available as needed (to get value from that data) ?
  • 6. Data Architecture Evolution: Gen 1 Data volumes are growing exponentially and on-prem is not cost effective & scalable!
  • 7. Cloud Adoption Trend “The worldwide infrastructure as a service (IaaS) market grew 37.3% in 2019 to total $44.5 billion, up from $32.4 billion in 2018, according to Gartner, Inc.” Cloud Adoption is growing at a rapid pace, why ? “Cloud computing offers access to data storage and compute on a more scalable, flexible and cost-effective than can be achieved with an on- premises deployment”
  • 8. Why Big Data workloads need Cloud? Some high level advantages: ● Pay as you go : No hardware acquisitions, thus Zero CAPEX ● Self Serve : Easier Accessibility ● Cost Effective & On-Demand ● Highly Elastic : Can scale 100s of nodes up/down easily ● No more installation/upgrade hassles ● Disaggregated Storage
  • 9. Data Architecture Evolution: Gen 2 Hadoop in the Public Cloud!
  • 10. Big Data in Cloud Hadoop: “Decade Two, Day Zero” Philosophy towards a modern Data Architecture ● Disaggregate storage, compute, security and governance ● Build for extremely large-scale using distributed systems ● Leverage open source for open standards and community scale ● Continuously evolve the ecosystem for innovation at every layer, independently
  • 12. Aspects of Enterprise Ready Cloud Platform
  • 13. Critical Aspects of Enterprise Cloud Platform ● Manage and monitor multiple clusters ● Secure data via single window ● Authentication & Authorization via single window ● Replicate data across multiple clusters on need basis ● Profile and debug queries across multiple clusters via single window ● Multiple experiences depending on the user (Data Engineering, Streaming, Fast Analytics, Data profiling etc) Classic Clusters (Optional)
  • 14. Manage multiple clusters in central place
  • 15. Ability to have control over the data end to end Provide access & control of data to end-users right from ingestion phase to prediction phase.
  • 16. Big Data Challenges on Cloud Platform
  • 17. Challenges in the dimension of - Storage - Network - Compute - Throttling - Security - Hardware Specs * These are some of dimensions that we would like to cover in today’s talk.
  • 18. Consistency & Latency Issues with ObjectStores ● Eventual Consistency Issues ○ Certain ObjectStores provide eventual consistency (e.g S3) ■ New files may not be visible for listing (until safely propagated internally). ■ Opening deleted file may be possible due to consistency issues ○ S3Guard ■ Uses “DynamoDB” to persist metadata changes. Provides consistent view of S3 objects for processing. ■ Supports DynamoDB on-demand (i.e no need to explicitly set capacity limits). ● Renames can be expensive ○ Rename = “Copy + Delete” in ObjectStores like S3. ○ Need to build stack which reduces rename operations or favours direct write to destination ● OS Page cache is not leveraged as data is read over network
  • 19. Intelligent Caching for Query Performance ● Avoid reading same data from ObjectStores ○ Systems like Hive/LLAP and Impala cache data locally for improving query performance.
  • 20. Reduce Network Latency ● Reduce number of SSL connections to ObjectStores ○ Added lazySeek implementation to reduce connection breakages.
  • 21. AutoScaling ● Determining the right cluster size can be challenging. ● AutoScaling helps in scaling up/down instances depending on workload ○ Concurrency Based AutoScaling ■ Helps in controlling number of parallel queries ○ Query Isolation ■ When queries scan beyond a certain limit, new clusters are automatically spun up.
  • 22. Affinity Policies for better Network Throughput - AutoScaling policies allow you spin up instances across different availability zones - By default cloud providers tend to spread instances across AZ for availability. - Impacts network throughput for nodes with 10Gbps speed - Set affinity policy to have the instances in the same availability zone
  • 23. Spin up Time ● Cluster/Compute spin up time plays a crucial role in adoption and reducing cost. ● Containerized deployments help a lot in reducing spin up time significantly with K8S ○ 10s of seconds as opposed to minutes
  • 24. K8S: Pods can have same hostname/port ● Pods can have same hostname/port after restart ● This causes trouble for processes tracking nodes based on hostname/port ● Added flexibility in the stack to take care of this situation ○ E.g TEZ-4179: [Kubernetes] Extend NodeId in tez to support unique worker identity
  • 25. Throttling ● Cloud services throttle requests ○ Throttling limits vary across cloud vendors ● Critical to monitor throttling metrics ○ Desirable to enable metrics logging in ObjectStore ○ Accuracy limited to per minute in most of the objectstores
  • 26. Throttling System trying to resend data over SSL on receiving 503 (throttling) causing CPU spike
  • 27. Security ● Perimeter Security ● Encrypted data at rest ● Transfer of intermediate data encrypted ● Need to use optimised libs for improving transport security
  • 28. Hardware Specs across Cloud Vendors ● Watch out for hardware specs across cloud vendors. ○ E.g SSD in Azure can have different perf characteristics than AWS ● OS settings have to be tweaked accordingly ○ E.g network, disk settings ● Choose optimal instance for the workload ○ E.g Instances with high density disks may not be needed as data is stored in ObjectStore ○ Too little disk space can hurt intermediate data being written out.
  • 29. Tomorrow ... ● Plenty of challenges to run Big Data workloads on Cloud ○ Great efforts from Open Source community! ● Users need “No vendor lock in” ○ An Open Data layer for multi-cloud (SODA, CSI etc with infinite possibilities) ○ Network standards across clouds (CNI) ○ Data Lineage and governance for user (Apache Atlas) ○ Security and access as open standard (Apache Ranger) ● Users are looking for an Open Data Architecture for multiple clouds which is enterprise ready!
  • 30. Thank You ● References ○ Cloudera Data Platform (Multi Cloud): https://docs.cloudera.com/cdp/latest/index.html ○ Hadoop: Decade two, Day zero: https://blog.cloudera.com/hadoop-decade-two-day-zero/ ● Cloudera careers
  • 31. Q/A

Hinweis der Redaktion

  1. For a true enterprise ready cloud platform We need a way to register, manage and control multiple clusters in a central place Need a way to handle security policies via central place Provide different user experiences depending on the data processing requirements like “Machine Learning”, “Data Warehouse”, “Data Engineering” and so on
  2. Observed this in Azure, where throttling can have adverse impact on CPU utilization. System was sending good amount of data to Azure ObjectStore and got throttled with 503 exceptions. Due to retry logic, system continued to retry and send over the same data over wire. This caused high CPU usage due to encryption
  3. Hardware specs across different cloud vendors could be very different. For instance, SSD in AWS gave around 288 MB/s speed, where as in Azure it gave 89 MB/s. Would recommend to measure performance, before choosing appropriate instances. OS settings need to be tweaked accordingly as well. For e.g we had to recently disable certain disk settings to avoid unwanted kernel calls, as we were on SSD. It would be good choose optimal instance type for the workload