SlideShare a Scribd company logo
1 of 34
Lessons Learned From
Dockerizing Spark Workloads
Thomas Phelan Nanda Vijaydev
Chief Architect, BlueData Data Scientist, BlueData
@tapbluedata @nandavijaydev
February 8, 2017
Outline
• Docker Containers and Big Data
• Spark on Docker: Challenges
• How We Did It: Lessons Learned
• Key Takeaways
• Q & A
Distributed Spark Environments
• Data scientists want flexibility:
– New tools, latest versions of Spark, Kafka, H2O, et.al.
– Multiple options – e.g. Zeppelin, RStudio, JupyterHub
– Fast, iterative prototyping
• IT wants control:
– Multi-tenancy
– Data security
– Network isolation
Why “Dockerize”?
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability (on-premises and
public cloud)
• Efficient (higher resource
utilization)
Applications
• Fool-proof packaging (configs,
libraries, driver versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
• Lightweight (virtually no
performance or startup penalty)
The Journey to Spark on Docker
Start with a clear
goal in sight
Begin with your Docker toolbox
of a single container and basic
networking and storage
So you want to run Spark on Docker in a
multi-tenant enterprise deployment?
Warning: there are some pitfalls & challenges
Spark on Docker: Pitfalls
Traverse the tightrope of
network configurations
Navigate the river of
container managers
• Swarm ?
• Kubernetes ?
• AWS ECS ?
• Mesos ?
• Overlay files ?
• Flocker ?
• Convoy ?
• Docker Networking ?
Calico
• Kubernetes Networking ?
Flannel, Weave Net
Cross the desert of storage
configurations
Spark on Docker: Challenges
Pass thru the jungle of
software configurations
Tame the lion of
performance
Finally you get to the top!
Trip down the staircase of
deployment mistakes
• R ?
• Python ?
• HDFS ?
• NoSQL ?
• On-premises ?
• Public cloud ?
Spark on Docker: Next Steps?
But for an enterprise-ready deployment,
you are still not even close to being done …
You still have to climb past:
high availability,
backup/recovery, security,
multi-host, multi-container,
upgrades and patches …
Spark on Docker: Quagmire
You realize it’s time
to get some help!
How We Did It: Spark on Docker
• Docker containers provide a powerful option for greater agility
and flexibility – whether on-premises or in the public cloud
• Running a complex, multi-service platform such as Spark in
containers in a distributed enterprise-grade environment can
be daunting
• Here is how we did it for the BlueData platform ... while
maintaining performance comparable to bare-metal
How We Did It: Design Decisions
• Run Spark (any version) with related models,
tools / applications, and notebooks unmodified
• Deploy all relevant services that typically run on
a single Spark host in a single Docker container
• Multi-tenancy support is key
– Network and storage security
– User management and access control
How We Did It: Design Decisions
• Docker images built to “auto-configure”
themselves at time of instantiation
– Not all instances of a single image run the same set
of services when instantiated
• Master vs. worker cluster nodes
How We Did It: Design Decisions
• Maintain the promise of Docker containers
– Keep them as stateless as possible
– Container storage is always ephemeral
– Persistent storage is external to the container
How We Did It: CPU & Network
• Resource Utilization
– CPU cores vs. CPU shares
– Over-provisioning of CPU recommended
– No over-provisioning of memory
• Network
– Connect containers across hosts
– Persistence of IP address across
container restart
– Deploy VLANs and VxLAN tunnels for
tenant-level traffic isolation
How We Did It: Network
OVS
Container
Orchestrator
DHCP/DNS
VxLAN tunnel
NIC
Tenant Networks
OVS
NIC
Resource
Manager
Node
Manager
Node Manager SparkMaster
SparkWorker
Zeppelin
SparkWorker
How We Did It: Storage
• Storage
– Tweak the default size of a container’s /root
• Resizing of storage inside an existing container is tricky
– Mount logical volume on /data
• No use of overlay file systems
– BlueData DataTap (version-independent, HDFS-compliant)
• Connectivity to external storage
• Image Management
– Utilize Docker’s image repository
TIP: Mounting block
devices into a container
does not support
symbolic links (IOW:
/dev/sdb will not work,
/dm/… PCI device can
change across host
reboot).
TIP: Docker images can
get large. Use “docker
squash” to save on size.
How We Did It: Security
• Security is essential since containers
and host share one kernel
– Non-privileged containers
• Achieved through layered set of
capabilities
• Different capabilities provide different
levels of isolation and protection
How We Did It: Security
• Add “capabilities” to a container based on what
operations are permitted
How We Did It: Sample Dockerfile
# Spark-1. docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract spark
RUN mkdir /usr/lib/spark; curl -s http://archive.apache.org/dist/spark/spark-2.0.1/spark-2.0.1-
bin-hadoop2.6.tgz | tar xz -C /usr/lib/spark/
## Install zeppelin
RUN mkdir /usr/lib/zeppelin; curl -s https://s3.amazonaws.com/bluedata-
catalog/thirdparty/zeppelin/zeppelin-0.7.0-SNAPSHOT.tar.gz | tar xz -C /usr/lib/zeppelin
ADD configure_spark_services.sh /root/configure_spark_services.sh
RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
BlueData App Image (.bin file)
Application bin file
Docker
image
CentOS
Dockerfile
RHEL
Dockerfile
appconfig
conf Init.d startscript
<app>
logo file
<app>.wb
bdwb
command
clusterconfig, image, role,
appconfig, catalog, service, ..
Sources
Docker file,
logo .PNG,
Init.d
RuntimeSoftware Bits
OR
Development
(e.g. extract .bin and modify
to create new bin)
Multi-Host
4 containers
on 2 different hosts
using 1 VLAN and 4
persistent IPs
Services per Container
Master Services
Worker Services
Container Storage from Host
Container Storage Host Storage
Performance Testing: Spark
• Spark 1.x on YARN
• HiBench - Terasort
• Data sizes: 100Gb, 500GB, 1TB
• 10 node physical/virtual cluster
• 36 cores and112GB memory per node
• 2TB HDFS storage per node (SSDs)
• 800GB ephemeral storage
Spark on Docker: Performance
MB/s
“Dockerized” Spark on BlueData
Spark with Zeppelin
Notebook
5 fully
managed
Docker
containers with
persistent IP
addresses
Pre-Configured Docker Images
Choice of data
science tools and
notebooks
(e.g. JupyterHub,
RStudio, Zeppelin)
and ability to
“bring-your-own”
tools and versions
On-Demand Elastic Infrastructure
Create “pristine” Docker
containers with appropriate
resources
Log in to container
and install your
tools & libraries
Spark on Docker: Key Takeaways
• All apps can be “Dockerized”, including Spark
– Docker containers enable a more flexible and agile
deployment model
– Faster app dev cycles for Spark app developers,
data scientists, & engineers
– Enables DevOps for data science teams
Spark on Docker: Key Takeaways
• Deployment requirements:
– Docker base images include all needed Spark
libraries and jar files
– Container orchestration, including networking and
storage
– Resource-aware runtime environment, including
CPU and RAM
Spark on Docker: Key Takeaways
• Data scientist considerations:
– Access to data with full fidelity
– Access to data processing and modeling tools
– Ability to run, rerun, and scale analysis
– Ability to compare and contrast various techniques
– Ability to deploy & integrate enterprise-ready solution
Spark on Docker: Key Takeaways
• Enterprise deployment challenges:
– Access to container secured with ssh keypair or
PAM module (LDAP/AD)
– Fast access to external storage
– Management agents in Docker images
– Runtime injection of resource and configuration
information
Spark on Docker: Key Takeaways
• “Do It Yourself” will be costly & time-consuming
– Be prepared to tackle the infrastructure challenges and
pitfalls to Dockerize Spark
– Your business value will come from data science and
applications – not the plumbing
• There are other options:
– BlueData = a turnkey solution,
for on-premises or on AWS
Thank You
Contact Info:
@tapbluedata
@nandavijaydev
tap@bluedata.com
nanda@bluedata.com
www.bluedata.com
Visit BlueData’s booth in the expo hall

More Related Content

What's hot

April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop Provisioning
DataWorks Summit
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 

What's hot (20)

April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Building Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak CoreBuilding Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak Core
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop Provisioning
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCS
Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCSOracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCS
Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCS
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 

Viewers also liked

Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
DataStax Academy
 
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Sri Ambati
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
Sri Ambati
 

Viewers also liked (20)

Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
JupyterHub - A "Thing Explainer" Overview
JupyterHub - A "Thing Explainer" OverviewJupyterHub - A "Thing Explainer" Overview
JupyterHub - A "Thing Explainer" Overview
 
ΜΑΚΕΔΟΝΙΑ
ΜΑΚΕΔΟΝΙΑΜΑΚΕΔΟΝΙΑ
ΜΑΚΕΔΟΝΙΑ
 
Data Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and CassandraData Analytics with Apache Spark and Cassandra
Data Analytics with Apache Spark and Cassandra
 
JupyterHub for Interactive Data Science Collaboration
JupyterHub for Interactive Data Science CollaborationJupyterHub for Interactive Data Science Collaboration
JupyterHub for Interactive Data Science Collaboration
 
Effective Management of Docker Containers
Effective Management of Docker ContainersEffective Management of Docker Containers
Effective Management of Docker Containers
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
'Dockerizing' within enterprises
'Dockerizing' within enterprises'Dockerizing' within enterprises
'Dockerizing' within enterprises
 
Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...
 
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
 
IBM Containers- Bluemix
IBM Containers- BluemixIBM Containers- Bluemix
IBM Containers- Bluemix
 
Enterprise Node - Deploying with Docker
Enterprise Node - Deploying with DockerEnterprise Node - Deploying with Docker
Enterprise Node - Deploying with Docker
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
 
Turkcell Technology Presentation
Turkcell Technology PresentationTurkcell Technology Presentation
Turkcell Technology Presentation
 

Similar to Lessons Learned from Dockerizing Spark Workloads

Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
dotCloud
 

Similar to Lessons Learned from Dockerizing Spark Workloads (20)

Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Continuous Integration with Docker on AWS
Continuous Integration with Docker on AWSContinuous Integration with Docker on AWS
Continuous Integration with Docker on AWS
 
Using Docker in production: Get started today!
Using Docker in production: Get started today!Using Docker in production: Get started today!
Using Docker in production: Get started today!
 
Docker in the Enterprise
Docker in the EnterpriseDocker in the Enterprise
Docker in the Enterprise
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aci
 
Docker
DockerDocker
Docker
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Docker handons-workshop-for-charity
Docker handons-workshop-for-charityDocker handons-workshop-for-charity
Docker handons-workshop-for-charity
 
Webinar Docker Tri Series
Webinar Docker Tri SeriesWebinar Docker Tri Series
Webinar Docker Tri Series
 
Why to docker
Why to dockerWhy to docker
Why to docker
 
Webinar : Docker in Production
Webinar : Docker in ProductionWebinar : Docker in Production
Webinar : Docker in Production
 
Oracle database on Docker Container
Oracle database on Docker ContainerOracle database on Docker Container
Oracle database on Docker Container
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Java in a world of containers
Java in a world of containersJava in a world of containers
Java in a world of containers
 
Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018Java in a World of Containers - DockerCon 2018
Java in a World of Containers - DockerCon 2018
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
 

More from BlueData, Inc.

How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
BlueData, Inc.
 

More from BlueData, Inc. (16)

Introduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupIntroduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes Meetup
 
Dell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big DataDell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big Data
 
BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)
 
BlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec Sheet
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White Paper
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 Overview
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made Easy
 
BlueData Integration with Cloudera Manager
BlueData Integration with Cloudera ManagerBlueData Integration with Cloudera Manager
BlueData Integration with Cloudera Manager
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 

Lessons Learned from Dockerizing Spark Workloads

  • 1. Lessons Learned From Dockerizing Spark Workloads Thomas Phelan Nanda Vijaydev Chief Architect, BlueData Data Scientist, BlueData @tapbluedata @nandavijaydev February 8, 2017
  • 2. Outline • Docker Containers and Big Data • Spark on Docker: Challenges • How We Did It: Lessons Learned • Key Takeaways • Q & A
  • 3. Distributed Spark Environments • Data scientists want flexibility: – New tools, latest versions of Spark, Kafka, H2O, et.al. – Multiple options – e.g. Zeppelin, RStudio, JupyterHub – Fast, iterative prototyping • IT wants control: – Multi-tenancy – Data security – Network isolation
  • 4. Why “Dockerize”? Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and public cloud) • Efficient (higher resource utilization) Applications • Fool-proof packaging (configs, libraries, driver versions, etc.) • Repeatable builds and orchestration • Faster app dev cycles • Lightweight (virtually no performance or startup penalty)
  • 5. The Journey to Spark on Docker Start with a clear goal in sight Begin with your Docker toolbox of a single container and basic networking and storage So you want to run Spark on Docker in a multi-tenant enterprise deployment? Warning: there are some pitfalls & challenges
  • 6. Spark on Docker: Pitfalls Traverse the tightrope of network configurations Navigate the river of container managers • Swarm ? • Kubernetes ? • AWS ECS ? • Mesos ? • Overlay files ? • Flocker ? • Convoy ? • Docker Networking ? Calico • Kubernetes Networking ? Flannel, Weave Net Cross the desert of storage configurations
  • 7. Spark on Docker: Challenges Pass thru the jungle of software configurations Tame the lion of performance Finally you get to the top! Trip down the staircase of deployment mistakes • R ? • Python ? • HDFS ? • NoSQL ? • On-premises ? • Public cloud ?
  • 8. Spark on Docker: Next Steps? But for an enterprise-ready deployment, you are still not even close to being done … You still have to climb past: high availability, backup/recovery, security, multi-host, multi-container, upgrades and patches …
  • 9. Spark on Docker: Quagmire You realize it’s time to get some help!
  • 10. How We Did It: Spark on Docker • Docker containers provide a powerful option for greater agility and flexibility – whether on-premises or in the public cloud • Running a complex, multi-service platform such as Spark in containers in a distributed enterprise-grade environment can be daunting • Here is how we did it for the BlueData platform ... while maintaining performance comparable to bare-metal
  • 11. How We Did It: Design Decisions • Run Spark (any version) with related models, tools / applications, and notebooks unmodified • Deploy all relevant services that typically run on a single Spark host in a single Docker container • Multi-tenancy support is key – Network and storage security – User management and access control
  • 12. How We Did It: Design Decisions • Docker images built to “auto-configure” themselves at time of instantiation – Not all instances of a single image run the same set of services when instantiated • Master vs. worker cluster nodes
  • 13. How We Did It: Design Decisions • Maintain the promise of Docker containers – Keep them as stateless as possible – Container storage is always ephemeral – Persistent storage is external to the container
  • 14. How We Did It: CPU & Network • Resource Utilization – CPU cores vs. CPU shares – Over-provisioning of CPU recommended – No over-provisioning of memory • Network – Connect containers across hosts – Persistence of IP address across container restart – Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
  • 15. How We Did It: Network OVS Container Orchestrator DHCP/DNS VxLAN tunnel NIC Tenant Networks OVS NIC Resource Manager Node Manager Node Manager SparkMaster SparkWorker Zeppelin SparkWorker
  • 16. How We Did It: Storage • Storage – Tweak the default size of a container’s /root • Resizing of storage inside an existing container is tricky – Mount logical volume on /data • No use of overlay file systems – BlueData DataTap (version-independent, HDFS-compliant) • Connectivity to external storage • Image Management – Utilize Docker’s image repository TIP: Mounting block devices into a container does not support symbolic links (IOW: /dev/sdb will not work, /dm/… PCI device can change across host reboot). TIP: Docker images can get large. Use “docker squash” to save on size.
  • 17. How We Did It: Security • Security is essential since containers and host share one kernel – Non-privileged containers • Achieved through layered set of capabilities • Different capabilities provide different levels of isolation and protection
  • 18. How We Did It: Security • Add “capabilities” to a container based on what operations are permitted
  • 19. How We Did It: Sample Dockerfile # Spark-1. docker image for RHEL/CentOS 6.x FROM centos:centos6 # Download and extract spark RUN mkdir /usr/lib/spark; curl -s http://archive.apache.org/dist/spark/spark-2.0.1/spark-2.0.1- bin-hadoop2.6.tgz | tar xz -C /usr/lib/spark/ ## Install zeppelin RUN mkdir /usr/lib/zeppelin; curl -s https://s3.amazonaws.com/bluedata- catalog/thirdparty/zeppelin/zeppelin-0.7.0-SNAPSHOT.tar.gz | tar xz -C /usr/lib/zeppelin ADD configure_spark_services.sh /root/configure_spark_services.sh RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
  • 20. BlueData App Image (.bin file) Application bin file Docker image CentOS Dockerfile RHEL Dockerfile appconfig conf Init.d startscript <app> logo file <app>.wb bdwb command clusterconfig, image, role, appconfig, catalog, service, .. Sources Docker file, logo .PNG, Init.d RuntimeSoftware Bits OR Development (e.g. extract .bin and modify to create new bin)
  • 21. Multi-Host 4 containers on 2 different hosts using 1 VLAN and 4 persistent IPs
  • 22. Services per Container Master Services Worker Services
  • 23. Container Storage from Host Container Storage Host Storage
  • 24. Performance Testing: Spark • Spark 1.x on YARN • HiBench - Terasort • Data sizes: 100Gb, 500GB, 1TB • 10 node physical/virtual cluster • 36 cores and112GB memory per node • 2TB HDFS storage per node (SSDs) • 800GB ephemeral storage
  • 25. Spark on Docker: Performance MB/s
  • 26. “Dockerized” Spark on BlueData Spark with Zeppelin Notebook 5 fully managed Docker containers with persistent IP addresses
  • 27. Pre-Configured Docker Images Choice of data science tools and notebooks (e.g. JupyterHub, RStudio, Zeppelin) and ability to “bring-your-own” tools and versions
  • 28. On-Demand Elastic Infrastructure Create “pristine” Docker containers with appropriate resources Log in to container and install your tools & libraries
  • 29. Spark on Docker: Key Takeaways • All apps can be “Dockerized”, including Spark – Docker containers enable a more flexible and agile deployment model – Faster app dev cycles for Spark app developers, data scientists, & engineers – Enables DevOps for data science teams
  • 30. Spark on Docker: Key Takeaways • Deployment requirements: – Docker base images include all needed Spark libraries and jar files – Container orchestration, including networking and storage – Resource-aware runtime environment, including CPU and RAM
  • 31. Spark on Docker: Key Takeaways • Data scientist considerations: – Access to data with full fidelity – Access to data processing and modeling tools – Ability to run, rerun, and scale analysis – Ability to compare and contrast various techniques – Ability to deploy & integrate enterprise-ready solution
  • 32. Spark on Docker: Key Takeaways • Enterprise deployment challenges: – Access to container secured with ssh keypair or PAM module (LDAP/AD) – Fast access to external storage – Management agents in Docker images – Runtime injection of resource and configuration information
  • 33. Spark on Docker: Key Takeaways • “Do It Yourself” will be costly & time-consuming – Be prepared to tackle the infrastructure challenges and pitfalls to Dockerize Spark – Your business value will come from data science and applications – not the plumbing • There are other options: – BlueData = a turnkey solution, for on-premises or on AWS