Presentation of my talk given in Phoenix Data Conference 2019. In this we will look at challenges with current Apache Hadoop ecosystem
Apache Hadoop is still relevant but way of doing Hadoop and enterprise data architecture has to be re-looked as we enter Cognitive and Cloud Native Era
We need
Architecture that is enabled by common run time layer across on premise and cloud
Architecture that can abstract away dependency and version conflicts with tons of open source machine learning out there. Yarn did not scale up in that aspect until one want to deal with multiple conda environment
Architecture that can enable real Hybrid Cloud and Multi Cloud portability
And many more challenges that one has to overcome to keep architecture simple, infrastructure agile and better utilized
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Future of Data Platform in Cloud Native world
1. FUTURE OF DATA PLATFORM IN CLOUD
NATIVE ERA
- Srivatsan Srinivasan
2. WHO AM I?
Chief Data Scientist at Cognizant
https://www.linkedin.com/in/srivatsan-srinivasan-b8131b/
https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw
AIEngineering
3. Cloud Native Data Application
Edge AI/Analytics
Hybrid Cloud
Prescriptive Analytics (From what to why)
Augmented Analytics
6. Is it really End of Hadoop Era?
• It did not live up with performance need of
Organization
• It was not able to replace existing EDW
Infrastructure
• It is too Hard to maintain and even hard for it
being cloud ready
• Cloud killed Hadoop
7. Is it really End of Hadoop Era?
• People failed Hadoop. It is people who did not
know what use case best fitted Hadoop
• People who were trying to solve technology
problem rather business problem
• Hadoop Architecture needs a Refresh in todays
world
• Underlying assumptions on which Hadoop was
created decade back is no longer relevant for
years now
• There is better way of doing Hadoop on premise
9. CHALLENGE 1 – Separate Data and Application Infrastructure
Data Infrastructure Application Infrastructure
10. CHALLENGE 1 – Separate Data and Application Infrastructure
Separate Infrastructure management
Separate Dev Ops/Data Ops
Not so efficient use of Infrastructure and
Specialized hardware accelerators
Application have to re-written during
movement from one environment to another
13. CHALLENGE 2 – Difficult Dependency and Version Management
Data Scientist need access to latest and
greatest version
Interdependency between multiple versions
Yarn does not provide way to isolate
dependency easily
Package dependency during spark-submit
Create different conda environment per
project
14. CHALLENGE 3 – Portability to Hybrid Infrastructure
On Premise
Application Application
Public Cloud
Pattern 1 – Build On premise and Deploy on Cloud
On Premise
(Primary)
Application Application
Public Cloud
(DR)
Pattern 2 – Primary On premise and DR on Cloud
Failover
Pattern 3 – Cloud Bursting
On Premise
(Primary Infra)
Application Application
Public Cloud
(Extended Infra)
Bust on
demand
On Premise
(Sensitive Data)
Application Application
Public Cloud
(Non sensitive data)
Pattern 4 – Placement based on Data Sensitivity and Data Gravity
15. CHALLENGE 4 – Reproducibility from development to production
16. CHALLENGES – Others
Spark version upgrade – All tenants impacted
Difficult defining deployment strategies like Champion/Challenger
deployment
Data Locality - Linearly scale storage and compute
All data has to be together
18. What Happened?
More’s law on Bandwidth happened making data locality not so important
Containers and Kubernetes happened making Yarn exclusive to few data
applications
Cloud Storage happened making Hadoop storage not so cheap (With Caveat
though..)
Apache Hadoop and supporting distributed systems were built in a world
were underlying assumptions were different than what it is today
What happened today?
19. What do we really need?
Common run time layer across your private and public cloud
Abstract away dependency and version conflicts
Efficient usage of existing infrastructure
Consistent tooling and CI/CD process across environments to increase
efficiency
Avoid vendor lock in for vendor portability
Handle Bursty workload
Time to provision new environments and agility to test latest offering
22. Operator Support for Data Application
Spark Operator
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Kafka Operator
https://www.confluent.io/confluent-operator/
https://github.com/strimzi/strimzi-kafka-operator
Flink Operator
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
Airflow Operator
https://github.com/GoogleCloudPlatform/airflow-operator
23. Step 1: Decouple compute and storage
S3, HDFS, GPFS, MapR-FS Spark
• Compute not being bound to storage. At same time use existing enterprise data storage if exists
• Assumes network throughput is higher
• Adds 2 to 6% latency depending on use case
24. Step 1: Decouple compute and storage
S3, HDFS, GPFS, MapR-FS Spark
Compute nodes can be adjusted to compute needs and Storage can scale independently
29. Kubernetes Operator
Automates deployment of
application
Operator is an method of
packaging, deploying and
managing instances of complex
stateful applications
It builds upon the basic
Kubernetes resource and
controller concepts but includes
domain or application-specific
knowledge to automate
common tasks
33. Spark Operator
Spark Operator controller watches for
create/delete/update events of
SparkApplication
Submission runner runs spark-
submit for submissions received from
the controller
34. Spark Operator
Spark Pod Monitor reports updates of
pods to controller
Mutating Admission WebHook handles
customization of Spark driver and
executor pods