Future of Data Platform in Cloud Native world

FUTURE OF DATA PLATFORM IN CLOUD
NATIVE ERA
- Srivatsan Srinivasan

WHO AM I?
Chief Data Scientist at Cognizant
https://www.linkedin.com/in/srivatsan-srinivasan-b8131b/
https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw
AIEngineering

Cloud Native Data Application
Edge AI/Analytics
Hybrid Cloud
Prescriptive Analytics (From what to why)
Augmented Analytics

Is it really End of Hadoop Era?

• It did not live up with performance need of
Organization
• It was not able to replace existing EDW
Infrastructure
• It is too Hard to maintain and even hard for it
being cloud ready
• Cloud killed Hadoop

• People failed Hadoop. It is people who did not
know what use case best fitted Hadoop
• People who were trying to solve technology
problem rather business problem
• Hadoop Architecture needs a Refresh in todays
world
• Underlying assumptions on which Hadoop was
created decade back is no longer relevant for
years now
• There is better way of doing Hadoop on premise

CHALLENGES WITH BIG DATA
PLATFORM

CHALLENGE 1 – Separate Data and Application Infrastructure
Data Infrastructure Application Infrastructure

CHALLENGE 1 – Separate Data and Application Infrastructure
 Separate Infrastructure management
 Separate Dev Ops/Data Ops
 Not so efficient use of Infrastructure and
Specialized hardware accelerators
 Application have to re-written during
movement from one environment to another

CHALLENGE 2 – Difficult Dependency Management

CHALLENGE 2 – Difficult Dependency and Version Management
 Data Scientist need access to latest and
greatest version
 Interdependency between multiple versions
 Yarn does not provide way to isolate
dependency easily
 Package dependency during spark-submit
 Create different conda environment per
project

CHALLENGE 3 – Portability to Hybrid Infrastructure
On Premise
Application Application
Public Cloud
Pattern 1 – Build On premise and Deploy on Cloud
On Premise
(Primary)
Public Cloud
(DR)
Pattern 2 – Primary On premise and DR on Cloud
Failover
Pattern 3 – Cloud Bursting
On Premise
(Primary Infra)
Public Cloud
(Extended Infra)
Bust on
demand
On Premise
(Sensitive Data)
Public Cloud
(Non sensitive data)
Pattern 4 – Placement based on Data Sensitivity and Data Gravity

CHALLENGE 4 – Reproducibility from development to production

CHALLENGES – Others
 Spark version upgrade – All tenants impacted
 Difficult defining deployment strategies like Champion/Challenger
deployment
 Data Locality - Linearly scale storage and compute
 All data has to be together

What Happened?
More’s law on Bandwidth happened making data locality not so important
Containers and Kubernetes happened making Yarn exclusive to few data
applications
Cloud Storage happened making Hadoop storage not so cheap (With Caveat
though..)
Apache Hadoop and supporting distributed systems were built in a world
were underlying assumptions were different than what it is today
What happened today?

What do we really need?
 Common run time layer across your private and public cloud
 Abstract away dependency and version conflicts
 Efficient usage of existing infrastructure
 Consistent tooling and CI/CD process across environments to increase
efficiency
 Avoid vendor lock in for vendor portability
 Handle Bursty workload
 Time to provision new environments and agility to test latest offering

Converged Infrastructure and Consistent Tooling
Data Applications Other Application
Kubernetes
Infrastructure

Converged Infrastructure and Consistent Tooling

Operator Support for Data Application
Spark Operator
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Kafka Operator
https://www.confluent.io/confluent-operator/
https://github.com/strimzi/strimzi-kafka-operator
Flink Operator
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
Airflow Operator
https://github.com/GoogleCloudPlatform/airflow-operator

Step 1: Decouple compute and storage
S3, HDFS, GPFS, MapR-FS Spark
• Compute not being bound to storage. At same time use existing enterprise data storage if exists
• Assumes network throughput is higher
• Adds 2 to 6% latency depending on use case

S3, HDFS, GPFS, MapR-FS Spark
Compute nodes can be adjusted to compute needs and Storage can scale independently

S3, GCS, Azure Blob Spark
Cloud Ready

Spark on Kubernetes – Native Support
spark-submit
--master k8s://<kubeserver>:<port>
--deploy-mode cluster
--name spark-tensorflow
--conf spark.executor.instances=4
--conf spark.kubernetes.container.image=pyspark-tf:v2.4.3
--conf spark.kubernetes.namespace=user1
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--conf spark.kubernetes.pyspark.pythonVersion=3
local:///app/model/train/spark_tf.py

Spark on Kubernetes – Native Support
Source: Google Cloud

Kubernetes Operator
Automates deployment of
application
Operator is an method of
packaging, deploying and
managing instances of complex
stateful applications
It builds upon the basic
Kubernetes resource and
controller concepts but includes
domain or application-specific
knowledge to automate
common tasks

Spark Operator Stack
Infrastructure
Source: cern.ch

Spark Operator
Spark Operator controller watches for
create/delete/update events of
SparkApplication
Submission runner runs spark-
submit for submissions received from
the controller

Spark Operator
Spark Pod Monitor reports updates of
pods to controller
Mutating Admission WebHook handles
customization of Spark driver and
executor pods

Future of Data Platform in Cloud Native world

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Future of Data Platform in Cloud Native world

Ähnlich wie Future of Data Platform in Cloud Native world (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Future of Data Platform in Cloud Native world