SlideShare ist ein Scribd-Unternehmen logo
1 von 46
DataWorks Summit 2018, San Jose, CA
What’s the ‘Hadoop-la’
about Kubernetes
Today’s Speakers
Nanda VijaydevAnant Chintamaneni
@NandaVijaydev@AnantCman
Vice President of Products
BlueData Software
Sr. Director of Solutions
BlueData Software
Agenda
• Market Dynamics (with containers)
• What is Kubernetes – Why should you care?
• Requirements for Stateful Hadoop Clusters
• Key gaps in Kubernetes for running Hadoop
• What will it take to go from here to there.
• Q & A
The “Promised Land”
Single “Container” Platform for multiple application patterns….
Public Cloud
Infrastructure
On-Prem
Infrastructure
Stateless
(web frontends, servers)
Stateful
(databases, queues)
Daemons
(log collection,
monitoring)
Others?
TargetInfraInfra-agnosticWorkloads
And the winner is……..
Kubernetes (K8s) – Key Points..
| Open source “platform” for containerized workloads
| Platform building blocks vs. turnkey platform
– https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not
| Top use case is stateless/microservices deployments
| Evolving for stateful and others
Kubernetes (K8s) – Key Concepts
| Kubernetes: a platform for application patterns
| Pod: a single instance of an application in Kubernetes
| Controller: manages replicated pods for an application pattern
Kubernetes (K8s) – Master/Worker
Kubernetes (K8s) – Pods
Kubernetes (K8s) – Controller
Kubernetes (K8s) – Service
Kubernetes (K8s) – Storage
| Volume: Ephemeral, Lifecycle of pod
| Persistent Volume: Networked Storage, Pod independent
| Persistent Volume Claim: Requested amount
Kubernetes (K8s) - Controller Patterns
Reality Check…. K8s challenges
source: https://www.cncf.io/blog/2017/06/28/survey-shows-kubernetes-leading-orchestration-platform/
Why Hadoop/Spark on Containers
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability
(on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging
(configs, libraries, driver
versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
This is not about using containers to run Hadoop/Spark tasks
on YARN:
Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
Not to be confused with……..
containers
cluster
Hadoop in Docker Containers
This is about running Hadoop clusters in containers:
Attributes of Hadoop Clusters
• Not exactly monolithic applications, but close
• Multiple, co-operating services with dynamic APIs
– Service start-up / tear-down ordering requirements
– Different sets of services running on different hosts (nodes)
– Tricky service interdependencies impact scalability
• Lots of configuration (aka state)
– Host name, IP address, ports, etc.
– Big meta-data: Hadoop and Spark service-specific configurations
RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
Master Node
RMNN
DN NM
Worker Node
DN NM
Worker Node
DN NM
Worker Node
Hadoop itself is clustered….
Hive Server2
Hive
Server
2
Data
Metadata
RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
JHS Job History Server
JN Journal Node
ZK ZooKeeper
HFS HttpFS Service
HM Hbase Master
HRS Hbase Region Server
Hue Hue
OZ Oozie
SHS Spark History Server
Ambari Ambari server
DB MySQL/Postgres
GW Gateway
FA Flume Agent
Tez Tez Service
SS Solr Server
Hive
on
LLAP
Hive on LLAP
RA Ranger
HS Hive Server
HSS Hive Metastore Service
ACK! There is seemingly no end to these
services & versions
…
And lots of services to keep in synch
• Use a Hadoop manager
– Hortonworks: Ambari
– Cloudera: Cloudera Manager
– MapR: MapR Control System (MCS)
• Follow common deployment pattern
• Ensures distro supportability
Managing and Configuring Hadoop
And we want multiple Hadoop clusters
Data Engineering SQL Analytics Machine Learning
“Containerized” Platform
Multiple evaluation teams
Evaluate different business use cases
(e.g. ETL, machine learning)
Use different services (e.g. Hive, Pig,
SparkR), different distributions / versions
Shared ‘containerized’ infrastructure
Petabyte scale data
2.6 2.2
Multiple distributions, services, tools on shared, cost-effective infrastructure
2.12.5 2.7
Data/Storage
Requirements for success
Hadoop
won’t
change
Resource
Management
(YARN)
Master
Services
running
always
Hadoop
Service
Dependency
& Endpoints
State
Persistence
(Data +
Metadata)
+
Hadoop Clusters on Kubernetes
Challenges and Gaps
• Existing, available Controller pattern is insufficient
• Hadoop service inter-communication via K8s Services
(clusterIP, NodePort etc) is not trivial
• Persistent volumes (PV) and the persistent volume
claim (PVC) approach needs to adapt to Hadoop
requirements for state persistence.
So is it to possible run Hadoop in all
its glory on Kubernetes (K8s)?
It’s a journey
Started with BlueData Custom Controller on K8s
12 months ago - we learnt a lot!
https://www.bluedata.com/blog/2017/12/big-data-container-orchestration-kubernetes-k8s/
HDP Cluster
Custom Controller - Architecture
K8s API Server K8s Scheduler K8s Controller Manager
Custom Controller
(Pod)
Ambari, NN,
RM
(Pod)
DN, NM
(Pod)
HDP Cluster
Pod
Pod
Pod
Pod
K8s Cluster
BlueData Namespace
Networking (Ex.Calico)
Default Namespace
DN, NM
(Pod)
DN, NM
(Pod)
• Launch statefulsets for defined roles
• Configure and start services in the right sequence
• Make the services available to end users – Network
and port mapping
• Secure the services with existing enterprise policies
(e.g. LDAP / AD)
• Maintain Big Data performance goals
Our ‘Custom Controller’ Approach..
Launching HDP on K8s with Ambari
Each role is a Statefulset.
4 Statefulsets for this cluster
Launch: BlueData UI or
API
- Cluster Metadata:
Manifest file
- Node Roles: statefulsets
- Node count: Nbr of pods
per role
- Node Services: List of
services and ports
HDP cluster running on K8s with BlueData
- A nodeport
service is
created per pod
for all endpoints
of each pod
Statefulset definition - details
• Persistent Storage
• Volume claim template
• Preserve / (root) to enable restarts and migration
• Both init & app container has definition to mount same “subPath” from
dynamic volume
• initContainer set up /var /opt /etc on volume dynamically provisioned, used
by app container
• Container access setup
• Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf
• Ease of use
• Added concept of flavor definition for CPU, memory, storage etc.
Key Gaps (Custom Controller)
Functional gaps
• Authentication and authorization was done by controller
• Limited to single namespace and lacked mapping to K8s
Multi-tenancy
Usability gaps
• Inability to use native kubectl commands for all operations
• Unable to use helm charts and other community projects
So what’s next to make it more K8s native and
address gaps..
Available Approaches….
• Use kubectl commands for simple deployment
• Use Helm charts for dependency management
• Use Operators for managing complex actions during and
after deployment
Operator = Custom Resource Definition (CRD) + Custom Controller
Creating Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act
Hadoop Operator:
1. Create statefulsets
2. Configure services
3. Map ports
4. Scale up/Scale down
5. Migrate to ensure FT
Custom Operator – CRD
• Native extension to standard K8s APIs
• Uses same authentication, authorization, and
audit logging
• Use kubectl commands to operate on CRD object
(e.g. create hadoopcluster)
• API request object will be stored in “etcd”
Example – CRD Registration and Usage
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: apps.hadoop.example.com
spec:
group: hadoop.example.com
version: v1alpha1
scope: Namespace
names:
plural: hadoopclusters
singular: hadoopcluster
kind: HadoopCluster
• Create:
kubectl create –f <CRD>.yaml
• New REST API endpoints :
/apis/hadoop.example.com/v1al
pha1/namespaces/*/hadoopclus
ters/...
Example – New objects using CRD
apiVersion: ”hadoop.example.com/v1alpha1"
kind: HadoopCluster
metadata:
name: my-new-hdp-cluster
spec:
image: bluedata/hdp26:v0.0.1
roles:
- name: master
replicas: 1
resources …
- name: worker
replicas: 4
resources …
…
Create:
kubectl create –f <request>.yaml
Manage:
kubectl get hadoopcluster
Custom Operator– Controller
• Watch on instances of objects with type defined in “CRD”
• Example: Create HDP cluster with Hive, and Oozie
• Runs scripts and services to coordinate activities between different pods for
clusters
• Example: Start HDFS, Start HiveServer2
• Any modifications, and scaling logic can be applied using custom controller
watch events
• Example: Expand and shrink cluster
• Same controller handles requests for multiple instances of custom object
• Example: Create and monitor multiple HDP clusters
Review Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act
• Lightweight Directory Access Protocol (LDAP) service
• Active Directory (AD) service
• Directory Name Service (DNS)
• Kerberos Key Distribution Center (KDC)
• Key Management Service (KMS)
Additional Configuration
• Networking
– Used calico for our testing
• Storage
– Persistent external storage (gluster)
• This approach allows us to run on any standard K8s
installation (1.9 and higher)
Network and access to services
Key Takeaways
• Kubernetes is still best suited for stateless services
• Complex stateful services like Hadoop requires significant work
• Statefulsets is a key enabler – necessary, but not sufficient
• New innovations and K8s contributions are needed to run Big Data
BlueData will simplify onboarding of Hadoop products to K8s
Thank You
For more information:
www.bluedata.com
Booth # S5

Weitere ähnliche Inhalte

Was ist angesagt?

Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
DataWorks Summit
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
 

Was ist angesagt? (20)

Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Nordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux OverviewNordic infrastructure Conference 2017 - SQL Server on Linux Overview
Nordic infrastructure Conference 2017 - SQL Server on Linux Overview
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 

Ähnlich wie What's the Hadoop-la about Kubernetes?

Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle Management
DoKC
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
QAware GmbH
 

Ähnlich wie What's the Hadoop-la about Kubernetes? (20)

Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Why kubernetes for Serverless (FaaS)
Why kubernetes for Serverless (FaaS)Why kubernetes for Serverless (FaaS)
Why kubernetes for Serverless (FaaS)
 
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless  - Serverless Summit 2017 - Krishna KumarKubernetes for Serverless  - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Big data and Kubernetes
Big data and KubernetesBig data and Kubernetes
Big data and Kubernetes
 
The App Developer's Kubernetes Toolbox
The App Developer's Kubernetes ToolboxThe App Developer's Kubernetes Toolbox
The App Developer's Kubernetes Toolbox
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle Management
 
Operator Lifecycle Management
Operator Lifecycle ManagementOperator Lifecycle Management
Operator Lifecycle Management
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
 
Galera on kubernetes_no_video
Galera on kubernetes_no_videoGalera on kubernetes_no_video
Galera on kubernetes_no_video
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

What's the Hadoop-la about Kubernetes?

  • 1. DataWorks Summit 2018, San Jose, CA What’s the ‘Hadoop-la’ about Kubernetes
  • 2. Today’s Speakers Nanda VijaydevAnant Chintamaneni @NandaVijaydev@AnantCman Vice President of Products BlueData Software Sr. Director of Solutions BlueData Software
  • 3. Agenda • Market Dynamics (with containers) • What is Kubernetes – Why should you care? • Requirements for Stateful Hadoop Clusters • Key gaps in Kubernetes for running Hadoop • What will it take to go from here to there. • Q & A
  • 4. The “Promised Land” Single “Container” Platform for multiple application patterns…. Public Cloud Infrastructure On-Prem Infrastructure Stateless (web frontends, servers) Stateful (databases, queues) Daemons (log collection, monitoring) Others? TargetInfraInfra-agnosticWorkloads
  • 5. And the winner is……..
  • 6. Kubernetes (K8s) – Key Points.. | Open source “platform” for containerized workloads | Platform building blocks vs. turnkey platform – https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not | Top use case is stateless/microservices deployments | Evolving for stateful and others
  • 7. Kubernetes (K8s) – Key Concepts | Kubernetes: a platform for application patterns | Pod: a single instance of an application in Kubernetes | Controller: manages replicated pods for an application pattern
  • 8. Kubernetes (K8s) – Master/Worker
  • 10. Kubernetes (K8s) – Controller
  • 12. Kubernetes (K8s) – Storage | Volume: Ephemeral, Lifecycle of pod | Persistent Volume: Networked Storage, Pod independent | Persistent Volume Claim: Requested amount
  • 13. Kubernetes (K8s) - Controller Patterns
  • 14. Reality Check…. K8s challenges source: https://www.cncf.io/blog/2017/06/28/survey-shows-kubernetes-leading-orchestration-platform/
  • 15. Why Hadoop/Spark on Containers Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and cloud) • Higher resource utilization Applications • Fool-proof packaging (configs, libraries, driver versions, etc.) • Repeatable builds and orchestration • Faster app dev cycles
  • 16. This is not about using containers to run Hadoop/Spark tasks on YARN: Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications Not to be confused with……..
  • 17. containers cluster Hadoop in Docker Containers This is about running Hadoop clusters in containers:
  • 18. Attributes of Hadoop Clusters • Not exactly monolithic applications, but close • Multiple, co-operating services with dynamic APIs – Service start-up / tear-down ordering requirements – Different sets of services running on different hosts (nodes) – Tricky service interdependencies impact scalability • Lots of configuration (aka state) – Host name, IP address, ports, etc. – Big meta-data: Hadoop and Spark service-specific configurations
  • 19. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode Master Node RMNN DN NM Worker Node DN NM Worker Node DN NM Worker Node Hadoop itself is clustered…. Hive Server2 Hive Server 2 Data Metadata
  • 20. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode JHS Job History Server JN Journal Node ZK ZooKeeper HFS HttpFS Service HM Hbase Master HRS Hbase Region Server Hue Hue OZ Oozie SHS Spark History Server Ambari Ambari server DB MySQL/Postgres GW Gateway FA Flume Agent Tez Tez Service SS Solr Server Hive on LLAP Hive on LLAP RA Ranger HS Hive Server HSS Hive Metastore Service ACK! There is seemingly no end to these services & versions … And lots of services to keep in synch
  • 21. • Use a Hadoop manager – Hortonworks: Ambari – Cloudera: Cloudera Manager – MapR: MapR Control System (MCS) • Follow common deployment pattern • Ensures distro supportability Managing and Configuring Hadoop
  • 22. And we want multiple Hadoop clusters Data Engineering SQL Analytics Machine Learning “Containerized” Platform Multiple evaluation teams Evaluate different business use cases (e.g. ETL, machine learning) Use different services (e.g. Hive, Pig, SparkR), different distributions / versions Shared ‘containerized’ infrastructure Petabyte scale data 2.6 2.2 Multiple distributions, services, tools on shared, cost-effective infrastructure 2.12.5 2.7 Data/Storage
  • 24. Hadoop Clusters on Kubernetes Challenges and Gaps • Existing, available Controller pattern is insufficient • Hadoop service inter-communication via K8s Services (clusterIP, NodePort etc) is not trivial • Persistent volumes (PV) and the persistent volume claim (PVC) approach needs to adapt to Hadoop requirements for state persistence.
  • 25. So is it to possible run Hadoop in all its glory on Kubernetes (K8s)?
  • 27. Started with BlueData Custom Controller on K8s 12 months ago - we learnt a lot! https://www.bluedata.com/blog/2017/12/big-data-container-orchestration-kubernetes-k8s/
  • 28. HDP Cluster Custom Controller - Architecture K8s API Server K8s Scheduler K8s Controller Manager Custom Controller (Pod) Ambari, NN, RM (Pod) DN, NM (Pod) HDP Cluster Pod Pod Pod Pod K8s Cluster BlueData Namespace Networking (Ex.Calico) Default Namespace DN, NM (Pod) DN, NM (Pod)
  • 29. • Launch statefulsets for defined roles • Configure and start services in the right sequence • Make the services available to end users – Network and port mapping • Secure the services with existing enterprise policies (e.g. LDAP / AD) • Maintain Big Data performance goals Our ‘Custom Controller’ Approach..
  • 30. Launching HDP on K8s with Ambari Each role is a Statefulset. 4 Statefulsets for this cluster Launch: BlueData UI or API - Cluster Metadata: Manifest file - Node Roles: statefulsets - Node count: Nbr of pods per role - Node Services: List of services and ports
  • 31. HDP cluster running on K8s with BlueData - A nodeport service is created per pod for all endpoints of each pod
  • 32. Statefulset definition - details • Persistent Storage • Volume claim template • Preserve / (root) to enable restarts and migration • Both init & app container has definition to mount same “subPath” from dynamic volume • initContainer set up /var /opt /etc on volume dynamically provisioned, used by app container • Container access setup • Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf • Ease of use • Added concept of flavor definition for CPU, memory, storage etc.
  • 33. Key Gaps (Custom Controller) Functional gaps • Authentication and authorization was done by controller • Limited to single namespace and lacked mapping to K8s Multi-tenancy Usability gaps • Inability to use native kubectl commands for all operations • Unable to use helm charts and other community projects
  • 34. So what’s next to make it more K8s native and address gaps..
  • 35. Available Approaches…. • Use kubectl commands for simple deployment • Use Helm charts for dependency management • Use Operators for managing complex actions during and after deployment Operator = Custom Resource Definition (CRD) + Custom Controller
  • 36. Creating Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act Hadoop Operator: 1. Create statefulsets 2. Configure services 3. Map ports 4. Scale up/Scale down 5. Migrate to ensure FT
  • 37. Custom Operator – CRD • Native extension to standard K8s APIs • Uses same authentication, authorization, and audit logging • Use kubectl commands to operate on CRD object (e.g. create hadoopcluster) • API request object will be stored in “etcd”
  • 38. Example – CRD Registration and Usage apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: name: apps.hadoop.example.com spec: group: hadoop.example.com version: v1alpha1 scope: Namespace names: plural: hadoopclusters singular: hadoopcluster kind: HadoopCluster • Create: kubectl create –f <CRD>.yaml • New REST API endpoints : /apis/hadoop.example.com/v1al pha1/namespaces/*/hadoopclus ters/...
  • 39. Example – New objects using CRD apiVersion: ”hadoop.example.com/v1alpha1" kind: HadoopCluster metadata: name: my-new-hdp-cluster spec: image: bluedata/hdp26:v0.0.1 roles: - name: master replicas: 1 resources … - name: worker replicas: 4 resources … … Create: kubectl create –f <request>.yaml Manage: kubectl get hadoopcluster
  • 40. Custom Operator– Controller • Watch on instances of objects with type defined in “CRD” • Example: Create HDP cluster with Hive, and Oozie • Runs scripts and services to coordinate activities between different pods for clusters • Example: Start HDFS, Start HiveServer2 • Any modifications, and scaling logic can be applied using custom controller watch events • Example: Expand and shrink cluster • Same controller handles requests for multiple instances of custom object • Example: Create and monitor multiple HDP clusters
  • 41. Review Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act
  • 42. • Lightweight Directory Access Protocol (LDAP) service • Active Directory (AD) service • Directory Name Service (DNS) • Kerberos Key Distribution Center (KDC) • Key Management Service (KMS) Additional Configuration
  • 43. • Networking – Used calico for our testing • Storage – Persistent external storage (gluster) • This approach allows us to run on any standard K8s installation (1.9 and higher) Network and access to services
  • 44. Key Takeaways • Kubernetes is still best suited for stateless services • Complex stateful services like Hadoop requires significant work • Statefulsets is a key enabler – necessary, but not sufficient • New innovations and K8s contributions are needed to run Big Data BlueData will simplify onboarding of Hadoop products to K8s
  • 45.
  • 46. Thank You For more information: www.bluedata.com Booth # S5