Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever.
Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator.
This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator.
Speaker
Rachit Arora, SSE, IBM
How to Troubleshoot Apps for the Modern Connected Worker
Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?
1. Why Kubernetes as a container
orchestrator is a right choice for
running spark clusters on cloud?
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs
2. Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation
4. Let look into role of Data Scientist
• I want to run my analytics jobs
• Social media analytics
• Text analytics (Structure and Unstructured)
• I want to run queries on demand
• I want to run R scripts
• I want to submit Spark jobs
• I want to view History Server Logs of my application
• I want to View Daemon logs
• I want to write Notebooks
5. Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics
6. Spark Serverless Characteristics
• No Servers to Provision
• Scale with usage
• Availability and fault tolerant
• Never pay for idle
Kernel
History
Server
Notebook
Server
Browswer
Data
Scientist
COS/Injestion
Data
Engineer
7. Typical Hadoop/Spark Cluster - Setup
• Get the suitable hardware
• Prepare host machine
• Setup various networks
• Private
• Public
• Management
• Fetch the binaries for the install
• Prepare the blueprint/config file for the install
• Start the install
• Many a times install fails, debug and retry again.
8. Earlier Experiments
Option OS Provisioning Config
Cluster Management /
Updates
1 Bare metal Chef Chef
2 xCAT – Stateful(Create your own VMs) PostScripts xCAT - updateNode
3 xCAT – Sysclone(Image from current system) Not Needed xCAT - updateNode
4 Bare metal PostScripts xCAT - updateNode
5 Cloud Provider Specific Images Not Needed Manual/Scripts
6 Standard-ISO Image Anaconda -post scripts Manual/Scripts
9. How do I build Serverless Spark
• Option 1: Vanilla Containers – If I need to build with Kubernetes
• Repeatable
• Application Portability
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure Utilization
10. Guiding Principles
• Virtualization helps repeatability, lesser failures & speed
• Maintenance
• Performance(Equivalent to Bare metal)
• Use open source from an active community
• Cloud-agnostic
13. Docker in Hadoop Cluster on Cloud
• Each cluster node is a virtual node powered by Docker. Each node of
the cluster is a Docker container
• Docker containers run on a bunch of bare metal hosts (Docker-hosts)
• Each Hadoop cluster will have multiple nodes/Docker containers
spanning multiple hosts
• Docker
• Container management - Custom
• Multi host networking – Overlay Network
• Registry – Private
• Local Storage
14. Typical Clusters
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
Data 4 Data 5Cluster 1
Cluster 2
Cluster 3
Network Boundary
15. Docker Images
• Master node
• Data Node
• Edge Node
• Auxiliary service images
• Ldap
• Mysql
• Ambari server
• KMS
16. Multi host Docker networking
• Weave based overlay network among nodes
• One /26 private subnet per cluster (172.x.x.x)
• Master node to have a public IP – ports-forwarding
• Portable public IPs
• Network speed (shared with other masters)
• Edge node will be accessible using a public IP
• User can SSH and run Hive, Hbase, Hadoop & Spark shells
• Private network
• High Speed
• Secure
17. Network Architecture
A A A B
bond1 bond0 eth-weave
docker0
10 Gbps Softlayer
private network
10 Gbps Softlayer
public network
B B B C
eth-weave
docker0
C C C C
eth-weave
docker0
bond1 bond0 bond1 bond0
docker port
forwarding
* docker ICC=false (no inter container communication over docker0 network)
* All inter-container communication is through weave network
* One weave’s private subnet per cluster (No communication across subnets)
weave overlay network on Softlayer private network
Host 1 Host 2 Host 3
Master
node
B BB
Master
node
Edge
node
Data
node
19. Provisioning Infrastructure
• Cluster Manager that provides REST API to create cluster
• API Gateway application
• Deployment agent
• Home grown Container Orchestrator - Deployer scripts that actually do all the work
• Prepare Directory structure
• Prepare network
• Start Containers with right options for
• Volumes
• Ports
• IP
• Hostname
• network
21. How is a cluster created?
Cluster
Manager
DBAPI Gateway
2. Manage resources
1. Create Cluster
Deployer
Agent
Deployer
Agent
Deployer
Agent
3. Create Cluster
4. PrepareNode
7. Install IOP
5. Get Node details
6. Get Blueprint
22. How do I build Serverless Spark
• Option 2 : Function as a service
• Single Node Cluster – Or No Cluster at all
• Spark local mode
• all in one Image
• Resource Limitations
• Design Limitations
23. How do I build Serverless Spark
• Option 3 : Kubernetes
* Slide from Kubernetes Scheduler Design & Discussion
24. What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization
26. Conclusion
• Spark Serverless - need for Data Scientist
• Kuberenetes enables
• Spark Clusters in Background and Kernel Up and Running in seconds
• High Availability
• Auto scaling with Spark monitoring and Kubernetes deployment features
• No Extra Money for idle time
27. References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora
Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications.
Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps.
Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
As a data scientist what I was required to do
On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more
Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them
Cover what is takes to install a hadoop/spark cluster
Wit Spalr and Hadoop serveless is whole new game “Function as aservice “
- History , Logs , Performance as if i am doing stuff on prem
Lets take case of Data Scentient
Note book - kernel , log , History Server expectation from spark server less
Data Engineer and Scientist sending request to Serverless Spark
Setup is hard
6.8 to 6.7 example
Optimal seetings for work loads
We do have such offering , its deployed in bluemix
xCAT is Extreme Cluster/Cloud Administration Toolkit, xCAT offers complete management for clusters, Grids, Clouds, Datacenters, and many other things. It is agile, extensible, and based on years of system administration best practices and experience. It enables you to:
Provision Operating Systems on physical or virtual machines: RHEL, CentOS, Fedora, SLES, Ubuntu, AIX, Windows, VMWare, KVM, PowerVM, PowerKVM, zVM.
Provision using scripted install, stateless, statelite, iSCSI, or cloning
What problems we faced
- Managing Hardware as a service – complex
Build solutions for Container orchestration
Container Security and OS Patch
PROBLEM : INSTALLATION AND CUSTOMIZATION
Data Persistence
- First we tried for imaging solution and when I dd docker run command . That was just magic
“How do I backup a container?”
How do I handle restart of host machine ?
“What’s my patch management strategy for my running containers?”
How do I network them ?
Where is my data going to be ?
How do I manage them ?
500 baremetal
We spin conatiners
we are not using orch as of now , will talk why
Our own registry
Local storage
Monolthic application being breaked down . We plan to still do it further to service break downs
Strech on benfits of separating out / like I can change ldap tp IPA , Mysql to other db , KMS verssion change .
Separate metrics for these conatiners
Stress on security
Stress on extensible to adopt newer networking solutions
Mentions it its not recommened to have ssh /
Talk about ip forwarding and multiple networks we have and need for those networks in the cluster
Explain here about the cpu allocation and the
schduleings strategies and how we wanted to schedule them and reason behind it for the taking
advanteage of the hadoop’s build in retundancy and replicatioon to have less failure chances
T
Talk here about orchestration needs , CPU SET and Local disks . how many of the current orchestrators not fitting in . Talk about the montioring aspects of the conatiners and logging aspects of the conatiners
Resource manager , layout maker
Deployer can be any of machines will work no master deployer here
Wait for nodes to prepare
Once up start the install which is config driven . Yum install hadoop ! Oh I have it , set up db , Oh I have it .
AWS Lambda , IBM Apache OpenWisk , Microsoft functions. No Cluster just executors
Not The experience I am used too. History Server Logs, Monitoring tools?
Inability to communicate directly: Spark using DAG execution framework spawns jobs with multiple stages. For inter-stage communication, Spark requires data transfer across executors. Many Fiunction as a service does not allow communication between two Lambda functions. This poses a challenge for running executors in this environment.
Extremely limited runtime resources: Many Function invocations are currently limited to a maximum execution duration of 5 minutes, 1536 MB memory and 512 MB disk space. Spark loves memory, can have a large disk footprint and can spawn long running tasks. This makes Functions a difficult environment to run Spark on.
Introduction to Kubernetes
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
It Manages Containers for me
It Manages High avilabilty
It Provides me flexibilty to choose resource I WANT and Persistence I want
Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools –
Reduced operational costs –
Improved infrastructure utilization
No or very less latency
IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework.
You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio.
The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization.
Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data
What is Analytics Engine?
You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.