2. ● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● Deploying Big Data Products on Scale
● Microservices and Containers
● Introduction to Kubernetes
● Kubernetes Abstractions
● Spark 2.0 Docker Images
● Building Spark Cluster
● Scaling Spark Cluster
● Multiple Clusters
● Resource Isolation
4. Problem Statement
Need of unified deployment platform to
deploy big data based products on
cloud and on-prem with support for non
big data tools at scale.
5. A brief about Tellius Product
● Advanced Analytics product with support for ETL, data
exploration , visualization and advanced machine
learning
● Uses mongodb, Akka, Memsql, Node.js,Angular apart
from the spark
● Supported on both on cloud and on-prem
● Scales from few gb data to TB’s
6. Challenges of deploying our product
● Should support both big data and non big data based
deployments
● Multiple frameworks need clustering support for
horizontal scaling Ex: Spark, Memsql,Akka etc
● Should support different cloud platforms : Aws, Azure
etc
● Should support on-prem deployments also
● Ability to scale on demand
7. Challenges of Resource Sharing
● As multiple parts of application need horizontal scaling
choosing the right machines becomes a challenge
● We need to define the clustering parameters in terms of
machines rather than resource usage
● Should we deploy spark and memsql , which memory
hungry, applications on same nodes or different nodes?
● If on same cluster, how to isolate the different
applications on their resource usage?
● Support for multi tenancy?
8. Current Options
● Amazon EMR only supports the big data tools
deployment on aws
● Databricks only supports spark based deployments
● Azure and Google Cloud has their own way of setting
up deployments and scaling the spark
● On-prem, cloudera and other distribution of hadoop
have their own way setting up cluster.
● Also none of the above option have automated way of
delivering non-big data tools.
10. Microservice
● Way of developing and deploying an application as
collection of multiple services which communicate to
each other with lightweight mechanisms, often an HTTP
resource API
● These services are built around business capabilities
and independently deployable by fully automated
deployment machinery
● These services can be written in different languages
and can have different deployment strategies
11. Containerisation
● Containerisation is os-level virtualization
● In VM world, each VM has it’s own copy of operating
system.
● Container share common kernel in a given machine
● Very light weight
● Supports resource isolation
● Most of the time, each micro service will be deployed as
independent container
● This gives ability to scale independently
12. Introduction to Docker
● Containers were available in some operating systems
like solaris over a decade
● Docker popularised the containers on linux
● Docker is container runtime for running containers on
multiple operating system
● Started at 2013 and now synonymous with container
● Rocket from Coreos and LXD from canonical are the
alternative ones
13. Challenges with Containers
● Containers makes individual services of application
scale independently, but make discovering and
consuming these services challenging
● Also monitoring these services across multiple hosts are
also challenging
● Ability to cluster multiple containers for big data
clustering is challenge by default docker tools
● So there need to be way to orchestrate these containers
when you run a lot of services on top of it
14. Container Orchestrators
● Container orchestration are the tools for orchestrating
the containers on scale
● They provide mainly
○ Declarative configurations
○ Rules and Constraints
○ Provisioning on multiple hosts
○ Service Discovery
○ Health Monitoring
● Support multiple container runtimes
15. Different Container Orchestrators
● Docker Compose - Not a orchestrator, but has basic
service discovery
● Docker Swarm by Docker Company
● Kubernetes by Google
● Apache Mesos with Docker integrations
16. Solution
● Deploy each part of the product as micro service
● Use a container orchestrator to scale each service
depending upon the needs
● Discover services using orchestrator capabilities
● Use the orchestrator to deploy on different cloud and
on-prem
18. Kubernetes
● Open source system for
○ Automating deployment
○ Scaling
○ Management
of containerized applications.
● Production Grade Container Orchestrator
● Based on Borg and Omega , the internal container
orchestrators used by Google for 15 years
● https://kubernetes.io/
19. Why Kubernetes
● Production Grade Container Orchestration
● Support for Cloud and On-Prem deployments
● Agnostic to Container Runtime
● Support for easy clustering and load balancing
● Support for service upgradation and rollback
● Effective Resource Isolation and Management
● Well defined storage management
20. Minikube
● Minikube is a tool that is used to run kubernetes locally
● It runs single node kubernetes cluster using
virtualization layers like virtual box, hyper-v etc
● In our example, we run minikube using virtualbox
● Very useful trying out kubernetes for development and
testing purpose
● For installation steps, refer
http://blog.madhukaraphatak.com/scaling-spark-with-kuber
netes-part-2/
21. Kubectl
● Kubectl is a command line utility to interact with
kubernetes REST API
● This allows us to create, manage and delete different
resources in kubernetes
● Kubectl can connect to any kubernetes cluster
irrespective where it’s running
● We need to install the kubectl with minikube for
interacting with kubernetes
22. Minikube Operations
● Starting minikube
minikube start
● Observe running VM in the virtualbox
● See kubernetes dashboard
minikube dashboard
● Run kubectl
kubectl get po
24. Different Types of Abstraction
● Compute Abstractions ( CPU)
Abstraction related to create and manage compute
entities. Ex : Pod, Deployment
● Service/Network Abstractions (Network)
Abstraction related to exposing service on network
● Storage Abstractions (Disk)
Disk related abstractions
26. Pod Abstraction
● Pod is a collection of one or more containers
● Smallest compute unit you can deploy on the
kubernetes
● Host Abstraction for Kubernetes
● All containers run in single node
● Provides the ability for containers to communicate to
each other using localhost
27. Defining Pod
● Kubernetes uses YAML/Json for defining resources in
its framework
● YAML is human readable serialization format mainly
used for configuration
● All our examples, uses the YAML.
● We are going to define a pod , where we create
container of nginx
● kube_examples/nginxpod.yaml
28. Creating and Running Pod
● Once we define the pod, we need create and run the
pod
kubectl create -f kube_examples/nginxpod.yaml
● See running pod
kubectl get po
● Observe same on dashboard
● Stop Pod
kubectl delete -f kube_examples/ngnixpod.yaml
29. Drawbacks of Pod Abstraction
● Pod abstraction allows to define only single copy
container at a time
● It’s good enough for monolithic web applications
● But for spark kind of applications, which we need
clustering, we need to define multiple copies of same
container for clustering purposes
● Also pod abstraction, doesn’t support high availability
and upgrade support
30. Deployment Abstraction
● Abstraction for end to end life cycle of pods
● Ability to
○ Create
○ Upgrade
○ Destroy
pods
● Support multiple replicas
● kube_examples/ngnixdeployment.yaml
32. Container Port
● containerPort exposes the specific port on the container
● Uses the underneath container runtime, like docker, to
implement this functionality
● Used for open up port for web container to listen on 80
etc
● kube_examples/ngnixdeployment.yaml
33. Service
● Service abstraction defines a set of logical pods.
● This is a network abstraction which defines a policy to
expose micro service using these pods to other parts of
the application.
● Separation of Concern for compute and service
● Ability to upgrade independent parts
● Labeling abstraction for connecting services and pods
● kube_examples/nginxservice.yaml
34. Creating and Running Service
● Create Service
kubectl create -f kube_examples/nginxservice.yaml
● List Services
kubectl get svc
● Describe Service Details
kubectl describe svc nginx-service
35. Service EndPoint
● By default, all the services defined in the kubernetes are
only accessible within the pods of the cluster
● This one make sure that only services needed has to be
exposed to the public explicitly
● So we need to know the end point to actually call this
service
● This can be retrieved using the below command
kubectl describe svc nginx-service
36. Testing Service With BusyBox
● Once we have the endpoint, we can test it by a pod
inside our cluster
● We create a pod of the image using busybox
● Busybox is a minimal linux distribution with shell utilities
● kubectl run -i --tty busybox --image=busybox
--restart=Never -- sh
● wget -0 - <end-point>
38. Need for Custom Spark Image
● All kubernetes deployments need a docker image to
create pod or deployment
● Default spark image and configuration provided in the
kubernetes uses old version of spark
● It also uses google cloud specific configuration which
we don’t need in our application
● Having custom image allows us to control the
upgradation of the spark in future
39. Docker File
● Dockerfile is a file format defined by docker to create
reproducible docker images
● We create single image for used in both spark master
and worker containers
● We are using spark 2.1.0 version with Java 8
● We will add external shell scripts for starting master and
starting worker
● docker/Dockerfile
40. Building Docker Image
● We need to connect to the docker daemon of the
minikube to build the image inside vm
eval $(minikube docker-env)
● Run docker ps
● Build the docker image
docker build -t spark-2.1.0-bin-hadoop2.6 .
● View docker images
docker images
42. Spark Master Deployment
● Spark Master deployment, defines the configuration for
running spark master as single pod
● We expose 7077 port as the master listens on that port
● Use start-master script inside the docker image to start
the spark-master
● We are using standalone cluster for cluster
● spark-master.yaml
43. Spark Master Service
● Once we define the spark-master, we need to expose it
using a service
● This service will be used for workers to connect to
master pod
● We will expose
○ 8080 - For Web UI
○ 7077 - For Connecting to master
● We also name the service as spark-master
44. Spark Worker Deployment
● Once we defined the spark-master, we need to define
the spark-worker deployment
● As it’s two node cluster, we will single worker as of now
● We will expose
○ 7078 - For UI communication purposes
● Uses start-worker.sh script to start the worker
● Doesn’t need the service as workers are not exposed
45. Testing Single Node Cluster
● We can verify the UI using port-forward
kubectl port-forward <spark-master-name> 8080:8080
● Login to the master
kubectl exec -it <spark-master-name> bash
● Run spark-shell and run spark code
/opt/spark/bin/spark-shell --master spark://spark-master:7077
sc.makeRDD(List(1,2,4,4)).count
49. Namespace Abstraction
● We can create multiple spark clusters on single
kubernetes cluster using namespace abstraction
● Namespace is a virtual cluster on physical kubernetes
cluster
● Namespace gives separate namespace for pods,
services etc
● We can also apply resource restriction on the
namespace for resource management
50. Multiple Cluster using Namespace
● Create namespace
kubectl create namespace cluster2
● Get all namespace
Kubectl get namespaces
● Set the namespace
export CONTEXT=$(kubectl config view | awk '/current-context/ {print $2}')
kubectl config set-context $CONTEXT --namespace=cluster2
52. Changing Version of Spark
● Now we have 2.1.0 version running
● We can change our deployment without changing our
configuration
● We have another image spark-1.6.3-bin-hadoop2.6
● We can use deployment abstraction lifecycle
management to set the different image to running pods
● This will make new pods up and then deletes the old
pods
53. Deployment Set Image
● kubectl set image deployment/spark-master
spark-master=spark-1.6.3-bin-hadoop2.6
● kubectl set image deployment/spark-worker
spark-worker=spark-1.6.3-bin-hadoop2.6
● kubectl rollout status deployment/spark-master
● kubectl rollout status deployment/spark-worker
55. Controlling Resource Usage
● By default, pod can use unlimited memory and cpu
● We can set minimum and maximum resource usage per
pod
● In our example, we are going to set limits on spark
worker which will use 1GB RAM and 1 core
● We can same information to spark also, so that it will
reflect on spark UI
● spark-worker-resource.yaml
56. Summary
● Microservice based architecture to develop and deploy
spark with other tools
● Use container orchestrator kubernetes to deploy and
manage application lifecycle
● Make sure deployment and service abstractions for
clustering and scale
● Use resource isolation of docker and kubernetes for
better server density