An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark

IBM SparkTechnology Center
Big Data Vilnius– Nov 2017
An Enterprise Analytics Platform with
Jupyter Notebooks and Apache Spark
Luciano Resende
IBM | Spark Technology Center

2
Data Science Platform Architect – IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
About me - Luciano Resende

Open Source Community Leadership
Spark Technology Center
Founding Partner 188+ Project Committers 77+ Projects
Key Open source steering committee
memberships OSS Advisory Board
Open Source

IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
4

Agenda
IBM Data Science Experience
IBM Analytics Engine
Challenges faced building Analytic Platform
Jupyter Enterprise Gateway
Jupyter Enterprise Gateway Deployment
References
5

IBM Data Science
Experience is an
environment that brings
together everything that a
Data Scientist needs to be
more productive, including
tools, data and content
Be a better data scientist
IBM Data Science Experience (DSX)

DSX is built on a foundation of open source,
primarily Jupyter notebooks
Notebooks are interactive
computational
environments, in which
you can combine code
execution, rich text,
mathematics, plots and
rich media.

Jupyter Notebook Platform Architecture
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts different languages
8

Follow-ups
TRY IT:
datascience.ibm.com
Free IBM Data Science trial:
https://ibm.biz/Bdj9xP

IBM Analytics Engine - Characteristics
IBM Analytics Engine is built on
open source Apache Hadoop
and Apache Spark. It provides
users flexibility of open source
and an opportunity to expand
on their existing open source
investments
IBM Analytics Engine helps Data
scientists, Data engineers, and
Developers to focus on building data
models and business solutions while
simplifying cluster administration
through easy to use interfaces for
management and integration
IBM Analytics Engine deploys
clusters in minutes with
enterprise-level security,
reliability, and powerful
integration capabilities for
data management, monitoring,
and dashboards.

Capabilities
Separation of compute and storage
• Scale compute and storage independently for
better economics
• Separate compute and storage ensure no data-
loss in cases of cluster failure
• Ease of incorporating patches or upgrades by
creating new clusters
• Spin up use case specific clusters using different
instance sizes for different use cases
• Uniform governance and collaboration through
WDP services
Ease of use and administration
• Access and administer through multiple
interfaces – Cloud Foundry CLI, REST APIs on
public interface, and GUI
• Enhanced flexibility for configuring and
clusters, including installing 3rd party libraries
through bootstrap scripts
• Deploy and scale clusters within minutes, in a
few clicks, including propagating libraries and
configurations to all nodes of the cluster

Capabilities
* Roadmap item
Enhanced reliability and security
• ‘Auto-heal’ capability recovers processes from
failure *
• Geo-replicated object store for disaster
avoidance
• Encrypted object store, data-at-rest, and data-
in-motion encryption* provide enhanced
levels of security
Flexibility and innovation of open source
• Built on ODPi compliant Apache Spark and Apache
Hadoop stack for portability between open source
environments
• Integrate analytics tools using standard, open
source libraries and drivers

Enterprise/Cloud Analytics Platform Characteristics
Large pool of shared computing resources
• Enterprise Cloud, Public Cloud or Hybrid
• Data in the cloud (Data Lakes/Object Storage)
Distributed Consumers
• Notebooks running local (users laptop) or as a service
Different Resource Utilization Patterns
• High number of idle resources
14

Analytics Platform – Current state of the art
Open Source Jupyter based Notebook Platform
• Single User sharing the same distributed filesystem and privileges
• Jupyter Kernels running as local process
• Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers.
• No security, users can see and control each others process using Jupyter’s administration
utilities.
15

Analytics Platform Today – Shared Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels inside the cluster
sharing it’s resources.
• All Jupyter kernels run under a shared, “service” user ID.
• Users can see and control each others’ kernels using
Jupyter’s administration utilities.
• All kernels and their associated Spark drivers run on a
single (configurable) node of the cluster.
16
Spark Cluster
Bob’s Desktop
Multiple Notebooks
Jupyter Kernel Gateway
(Sandboxed by service user privileges)
Jupyter Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
(as Alice)Spark Executors
(as JNBG Service User)
Kernel
[Spark Driver]
(yarn-client mode as
JNBG Service User)
YARN
Workers
Bob’s Desktop
Multiple Notebooks
Jupyter
Notebook
Server
(with NB2KG)
Security
Layer
Kernel
[Spark Driver]
JNBG Service User)
Executors
(as Alice)Executors

Analytics Platform Today – Single User Cluster
Allows Jupyter notebooks running outside of the
cluster to run Jupyter kernels in a cluster created
specially to the user.
• Expensive as clusters are created for every individual
user
17
Spark Cluster
Bob’s Desktop
Multiple Notebooks
(Sandboxed by service user privileges)
Jupyter Kernel
Gateway
Jupyter
Notebook
Server
(with NB2KG)
Executors
(as Alice)Executors
Kernel
[Spark Driver]
JNBG Service User)
YARN
Workers

1
8

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter
Notebooks to share resources across an Apache Spark cluster aiming on
Enterprise/Cloud requirements and use cases
19

Jupyter Enterprise Gateway – Goals
Optimized Resource Allocation
•Run Spark in YARN Cluster Mode to better utilize cluster resources.
•Pluggable architecture for additional Resource Managers
Enhanced Security
•Secure socket communications
•Any HTTP communication should be encrypted (SSL)
Multiuser support with user impersonation
•Enhance security and sandboxing by enabling user impersonation when running kernels (using Kerberos).
•Individual HDFS home folder for each notebook user.
•Use the same user ID for notebook and batch jobs.
20

Supported Platforms
• Python/Spark 2.x using IPython kernel
• With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
• With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
21

22
Kernel scalability comparison: Cluster mode vs Client mode

Jupyter Enterprise Gateway Functionality
• Enable running kernels remotely in a cluster
• Pluggable kernel lifecycle management
• Enhanced security
• Multiuser leveraging user impersonation
23
Jupyter Notebook Server

Spark Cluster
24
Security
Layer
YARN
Workers
Jupyter EnterpriseGateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under Alice’s
user ID.

Jupyter Enterprise Gateway – Roadmap
• Kernel Configuration Profile
• Enable client to request different resource configuration for kernels (e.g. small, medium, large)
• Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
• Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• Add support for other resource managers
• User Environments
• High Availability
25

2
6
Building your own test environment with

Jupyter Enterprise Gateway - Deployment
27
Management Node
Powered by AmbariEG
Compute Engine based on Apache Spark

Ansible deployment scripts
• https://github.com/lresende/spark-cluster-install
One click deployment of the Spark Cluster
• Configure your host inventory (see example on git repository)
• Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
One click deployment of the Jupyter Enterprise Engine
• Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c paramiko
28

Docker images
• yarn-spark: Basic one node Spark on Yarn configuration
• enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the yarn-spark image
• nb2kg: Minimal Jupyter Notebook client configured with hooks to access the Enterprise Gateway
• https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
Building the latest docker images
• git checkout https://github.com/jupyter-incubator/enterprise_gateway
• make docker-clean docker-images
Note: Make also have individual targets to clean and build individual images (type make for help)
29

Connecting to a Spark Cluster using a docker image
docker run -t --rm
-e KG_URL='http://<Enterprise Gateway IP>:8888'
-p 8888:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=40
-e KG_CONNECT_TIMEOUT=40
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg:dev
30

Jupyter Enterprise Gateway at IBM Code
https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
Jupyter Enterprise Gateway source code at GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
31
Jupyter Enterprise
Gateway 0.7 release
just came out on
Nov 20th
https://groups.google.com/forum/#!topic/jupyter
/DzjvuCHwPwo
Free
IBM Data Science trial
https://ibm.biz/Bdj9xP

An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Ähnlich wie An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark

Ähnlich wie An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark (20)

Mehr von Luciano Resende

Mehr von Luciano Resende (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark