SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Collaborative data science and
build data science tool chain around
Notebook technologies
Creator of Apache Zeppelin
Co-Founder, CTO
Moon soo Lee
moon@zepl.com
#ODSC 2018
Who am I
A big believer that data science notebook changes how people collaborate
Creator of Apache Zeppelin
Co-founder
https://github.com/Leemoonsoo
www.zepl.com
#ODSC 2018
It was 2013, really wanted to have
interactive analytics interface for .
#ODSC 2018
Started an opensource project -
Zeppelin http://zeppelin-project.org/
data science notebook.
Became an project in 2016.
http://zeppelin.apache.org
#ODSC 2018
Iterations REPL interface (2013)
Editor / Result interface (2013)
Notebook interface (2014)
#ODSC 2018
Zeppelin
Multi-language in a notebook
Python, R, Scala, SQL, ...
Plugin visualization and online repository
Authentication
LDAP, AD, ...
Authorization
Notebook access control
Build-in notebook scheduler
#ODSC 2018
Pilot to Production in 1 day
Hey, take a look
I need an update every morning!
Data scientist
Business
#ODSC 2018
More notebook consumers than producers
#ODSC 2018
Realized that notebook is a great collaboration tool
Why notebook?
#ODSC 2018
Notebook is
- Interactive
- Flexible
- Visualized
- Inline description
- Contain a story
- Shareable
#ODSC 2018
How to build collaborative environment
with notebook technology
#ODSC 2018
Data scientist
Data engineer Data Analyst
Marketing
SW
engineer
Sales
Executive
You
Notebook Sharing
#ODSC 2018
You’re using only half of its
potential if not sharing
#ODSC 2018
Github
nbviewer
Zeppelin
Airbnb/knowledge-repo
Commercial services for notebook sharing
VCS
Open
source
Service
#ODSC 2018
Github
● Store notebook in github
● Versioning
● Github provides .ipynb viewer
● Fork / pull request / merge
● Private / Public / Team / Org
● Hard to apply Notebook level ACL
● Not easy for Non-engineers
#ODSC 2018
nbviewer
● Publishing notebook
● Share notebook by sharing link
● Easy use
● No access control
Nbconvert (endering ipynb to static HTML) as a webservice
#ODSC 2018
Airbnb/knowledge-repo
https://github.com/airbnb/knowledge-repo
● .ipynb, md as a post
● Git repo for version control
● Feeds
● Search
● No access control
#ODSC 2018
Apache Zeppelin
● Share notebook with ACL, Read/Write/Execute
● In case of Jupyter notebook, need to convert .ipynb to zeppelin format in
command line.
#ODSC 2018
Commercial services for notebook sharing
Google Colab
● Share notebook through google drive
● View/Edit/Run .ipynb notebook using Colab
● Realtime collaboration
ZEPL
● Notebook level ACL
● View/Edit/Run .ipynb and Zeppelin notebook
● Realtime collaboration
● Import existing notebook from git/s3 storage
www.zepl.com
#ODSC 2018
#ODSC 2018
DON’Ts
● Email attach
● Direct send
● Share through USB
● ...
Email attach
Local copy in laptop
USB drive
#ODSC 2018
DO’s
● Provide access to the same dataset
● Access control capability
● Horizontal scalability
#ODSC 2018
Data catalog
● Provides location of data, what it means and how to load
○ e.g.
● Catalogue need to be accessible / searchable / annotatable
● Many different way to build depends on team / infra
○ Hive Metastore as a data catalog
○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog)
○ Data catalog / publishing software (e.g. CKAN, DKAN)
○ Custom built on top of RDBMS, Nosql, Indexing engine
○ Build data catalog using Notebook
Dataset Location Schema Note
Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. ….
Images s3://service/images 512x256 pixel images Images are collected from profile photo...
#ODSC 2018
Build data catalog using Notebook
● Flexible enough to describe data
● Searchable, shareable, annotatable
● Programmatic generation
#ODSC 2018
Multi-user environment
#ODSC 2018
I like my notebook running on my laptop.
No you don’t.
#ODSC 2018
Sign in and Run
Install libraries and
Install notebook and
Configure driver, environments and
Request access to data and
Setup access to notebook repo and
….
Run
#ODSC 2018
Reverse Proxy
JupyterHub
/hub
Jupyter server
Kernel (Python, R)
Jupyter server
Kernel (Python, R)
/user/[name]
Authenticator
Spawner
Notebook
Storage
(Filesystem, Git, etc)
LDAP,
OAuth,
etc
Docker, k8s
Zeppelin Server
LDAP,
OAuth,
etc
Notebook
Storage
(Filesystem, Git, etc)
Interpreter Manager
Auth / ACL
Interpreter (kernel)
Interpreter (kernel)
Interpreter (kernel)
#ODSC 2018
● Easier to implement / manage
● Notebook sharing is decoupled with
execution environment
● e.g.
○ JupyterHub
○ AWS Sagemaker
Reverse Proxy
Single user
Notebook server
Kernel
Single user
Notebook server
Kernel
Notebook
Storage
Multi user
Notebook server
Notebook
Storage
Kernel Kernel Kernel
Browser
Browser
● More complex to implement / manage
● Notebook sharing is coupled with execution
environment. Can expect more integrated
sharing environment.
● e.g.
○ Apache Zeppelin
○ ZEPL
○ Google Colab
#ODSC 2018
Reproducibility on notebook
1. Configure environment
a. %env, %python.config, %spark.config
2. Install libraries
a. !pip install, %spark.dep
3. Load data
4. Your work
5. Print libraries
a. !pip list, %conda list
#ODSC 2018
Notebook to production
Built-in scheduler External scheduler
Zeppelin
zepl
REST api
#ODSC 2018
Notebook to production
Rewrite :) and submit
In C/C++, Python, scala ...
Export, Submit notebook as a application
- Run notebook in command line
- Export notebook as a spark application
- https://github.com/CODAIT/notebook-exporter/tree/master
/notebook-exporter
Data pipeline
#ODSC 2018
Conclusion
● Share notebook
● Share Data
● Multi-user environment
Enables collaboration}
Things to consider
● Reproducibility
● Notebook to production
#ODSC 2018
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/MeteorWhy UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/MeteorJon Wong
 
GraphQL: Enabling a new generation of API developer tools
GraphQL: Enabling a new generation of API developer toolsGraphQL: Enabling a new generation of API developer tools
GraphQL: Enabling a new generation of API developer toolsSashko Stubailo
 
GraphQL: The Missing Link Between Frontend and Backend Devs
GraphQL: The Missing Link Between Frontend and Backend DevsGraphQL: The Missing Link Between Frontend and Backend Devs
GraphQL: The Missing Link Between Frontend and Backend DevsSashko Stubailo
 
The Apollo and GraphQL Stack
The Apollo and GraphQL StackThe Apollo and GraphQL Stack
The Apollo and GraphQL StackSashko Stubailo
 
Machine learning on kubernetes
Machine learning on kubernetesMachine learning on kubernetes
Machine learning on kubernetesAnirudh Ramanathan
 
Taking Control of your Data with GraphQL
Taking Control of your Data with GraphQLTaking Control of your Data with GraphQL
Taking Control of your Data with GraphQLVinci Rufus
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamJ On The Beach
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowRustem Zakiev
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at DailymotionGermain Tanguy
 
Building Notebook-based AI Pipelines with Elyra and Kubeflow
Building Notebook-based AI Pipelines with Elyra and KubeflowBuilding Notebook-based AI Pipelines with Elyra and Kubeflow
Building Notebook-based AI Pipelines with Elyra and KubeflowDatabricks
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherSashko Stubailo
 
Building Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsBuilding Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsJ On The Beach
 
GraphQL over REST at Reactathon 2018
GraphQL over REST at Reactathon 2018GraphQL over REST at Reactathon 2018
GraphQL over REST at Reactathon 2018Sashko Stubailo
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQLBrainhub
 
An intro to GraphQL
An intro to GraphQLAn intro to GraphQL
An intro to GraphQLvaluebound
 
DeNA West & BigQuery
DeNA West & BigQueryDeNA West & BigQuery
DeNA West & BigQueryYoshi Izawa
 
How to GraphQL: React Apollo
How to GraphQL: React ApolloHow to GraphQL: React Apollo
How to GraphQL: React ApolloTomasz Bak
 

Was ist angesagt? (20)

Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/MeteorWhy UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
 
GraphQL: Enabling a new generation of API developer tools
GraphQL: Enabling a new generation of API developer toolsGraphQL: Enabling a new generation of API developer tools
GraphQL: Enabling a new generation of API developer tools
 
GraphQL: The Missing Link Between Frontend and Backend Devs
GraphQL: The Missing Link Between Frontend and Backend DevsGraphQL: The Missing Link Between Frontend and Backend Devs
GraphQL: The Missing Link Between Frontend and Backend Devs
 
The Apollo and GraphQL Stack
The Apollo and GraphQL StackThe Apollo and GraphQL Stack
The Apollo and GraphQL Stack
 
Machine learning on kubernetes
Machine learning on kubernetesMachine learning on kubernetes
Machine learning on kubernetes
 
Taking Control of your Data with GraphQL
Taking Control of your Data with GraphQLTaking Control of your Data with GraphQL
Taking Control of your Data with GraphQL
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on Kubeflow
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at Dailymotion
 
Building Notebook-based AI Pipelines with Elyra and Kubeflow
Building Notebook-based AI Pipelines with Elyra and KubeflowBuilding Notebook-based AI Pipelines with Elyra and Kubeflow
Building Notebook-based AI Pipelines with Elyra and Kubeflow
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits together
 
Building Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsBuilding Applications with Streams and Snapshots
Building Applications with Streams and Snapshots
 
GraphQL over REST at Reactathon 2018
GraphQL over REST at Reactathon 2018GraphQL over REST at Reactathon 2018
GraphQL over REST at Reactathon 2018
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
An intro to GraphQL
An intro to GraphQLAn intro to GraphQL
An intro to GraphQL
 
DeNA West & BigQuery
DeNA West & BigQueryDeNA West & BigQuery
DeNA West & BigQuery
 
How to GraphQL: React Apollo
How to GraphQL: React ApolloHow to GraphQL: React Apollo
How to GraphQL: React Apollo
 

Ähnlich wie Collaborative data science and how to build a data science toolchain around notebook technologies odsc 2018 boston (1)

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadKarthik Murugesan
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfPrefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfJeff Hale
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Runwesley chun
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDJulian Mazzitelli
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesYshay Yaacobi
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfLupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfWolfgangZiegler6
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May MeetupDeepak Garg
 
Why we chose Argo Workflow to scale DevOps at InVision
Why we chose Argo Workflow to scale DevOps at InVisionWhy we chose Argo Workflow to scale DevOps at InVision
Why we chose Argo Workflow to scale DevOps at InVisionNebulaworks
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudTreasure Data, Inc.
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudEduardo Silva Pereira
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
DocDoku: Using web technologies in a desktop application. OW2con'15, November...
DocDoku: Using web technologies in a desktop application. OW2con'15, November...DocDoku: Using web technologies in a desktop application. OW2con'15, November...
DocDoku: Using web technologies in a desktop application. OW2con'15, November...OW2
 

Ähnlich wie Collaborative data science and how to build a data science toolchain around notebook technologies odsc 2018 boston (1) (20)

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfPrefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfLupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
 
Openstack India May Meetup
Openstack India May MeetupOpenstack India May Meetup
Openstack India May Meetup
 
Why we chose Argo Workflow to scale DevOps at InVision
Why we chose Argo Workflow to scale DevOps at InVisionWhy we chose Argo Workflow to scale DevOps at InVision
Why we chose Argo Workflow to scale DevOps at InVision
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
DocDoku: Using web technologies in a desktop application. OW2con'15, November...
DocDoku: Using web technologies in a desktop application. OW2con'15, November...DocDoku: Using web technologies in a desktop application. OW2con'15, November...
DocDoku: Using web technologies in a desktop application. OW2con'15, November...
 

Kürzlich hochgeladen

Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 

Kürzlich hochgeladen (20)

Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 

Collaborative data science and how to build a data science toolchain around notebook technologies odsc 2018 boston (1)

  • 1. Collaborative data science and build data science tool chain around Notebook technologies Creator of Apache Zeppelin Co-Founder, CTO Moon soo Lee moon@zepl.com
  • 2. #ODSC 2018 Who am I A big believer that data science notebook changes how people collaborate Creator of Apache Zeppelin Co-founder https://github.com/Leemoonsoo www.zepl.com
  • 3. #ODSC 2018 It was 2013, really wanted to have interactive analytics interface for .
  • 4. #ODSC 2018 Started an opensource project - Zeppelin http://zeppelin-project.org/ data science notebook. Became an project in 2016. http://zeppelin.apache.org
  • 5. #ODSC 2018 Iterations REPL interface (2013) Editor / Result interface (2013) Notebook interface (2014)
  • 6. #ODSC 2018 Zeppelin Multi-language in a notebook Python, R, Scala, SQL, ... Plugin visualization and online repository Authentication LDAP, AD, ... Authorization Notebook access control Build-in notebook scheduler
  • 7. #ODSC 2018 Pilot to Production in 1 day Hey, take a look I need an update every morning! Data scientist Business
  • 8. #ODSC 2018 More notebook consumers than producers
  • 9. #ODSC 2018 Realized that notebook is a great collaboration tool Why notebook?
  • 10. #ODSC 2018 Notebook is - Interactive - Flexible - Visualized - Inline description - Contain a story - Shareable
  • 11. #ODSC 2018 How to build collaborative environment with notebook technology
  • 12. #ODSC 2018 Data scientist Data engineer Data Analyst Marketing SW engineer Sales Executive You Notebook Sharing
  • 13. #ODSC 2018 You’re using only half of its potential if not sharing
  • 15. #ODSC 2018 Github ● Store notebook in github ● Versioning ● Github provides .ipynb viewer ● Fork / pull request / merge ● Private / Public / Team / Org ● Hard to apply Notebook level ACL ● Not easy for Non-engineers
  • 16. #ODSC 2018 nbviewer ● Publishing notebook ● Share notebook by sharing link ● Easy use ● No access control Nbconvert (endering ipynb to static HTML) as a webservice
  • 17. #ODSC 2018 Airbnb/knowledge-repo https://github.com/airbnb/knowledge-repo ● .ipynb, md as a post ● Git repo for version control ● Feeds ● Search ● No access control
  • 18. #ODSC 2018 Apache Zeppelin ● Share notebook with ACL, Read/Write/Execute ● In case of Jupyter notebook, need to convert .ipynb to zeppelin format in command line.
  • 19. #ODSC 2018 Commercial services for notebook sharing Google Colab ● Share notebook through google drive ● View/Edit/Run .ipynb notebook using Colab ● Realtime collaboration ZEPL ● Notebook level ACL ● View/Edit/Run .ipynb and Zeppelin notebook ● Realtime collaboration ● Import existing notebook from git/s3 storage www.zepl.com
  • 21. #ODSC 2018 DON’Ts ● Email attach ● Direct send ● Share through USB ● ... Email attach Local copy in laptop USB drive
  • 22. #ODSC 2018 DO’s ● Provide access to the same dataset ● Access control capability ● Horizontal scalability
  • 23. #ODSC 2018 Data catalog ● Provides location of data, what it means and how to load ○ e.g. ● Catalogue need to be accessible / searchable / annotatable ● Many different way to build depends on team / infra ○ Hive Metastore as a data catalog ○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog) ○ Data catalog / publishing software (e.g. CKAN, DKAN) ○ Custom built on top of RDBMS, Nosql, Indexing engine ○ Build data catalog using Notebook Dataset Location Schema Note Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. …. Images s3://service/images 512x256 pixel images Images are collected from profile photo...
  • 24. #ODSC 2018 Build data catalog using Notebook ● Flexible enough to describe data ● Searchable, shareable, annotatable ● Programmatic generation
  • 26. #ODSC 2018 I like my notebook running on my laptop. No you don’t.
  • 27. #ODSC 2018 Sign in and Run Install libraries and Install notebook and Configure driver, environments and Request access to data and Setup access to notebook repo and …. Run
  • 28. #ODSC 2018 Reverse Proxy JupyterHub /hub Jupyter server Kernel (Python, R) Jupyter server Kernel (Python, R) /user/[name] Authenticator Spawner Notebook Storage (Filesystem, Git, etc) LDAP, OAuth, etc Docker, k8s Zeppelin Server LDAP, OAuth, etc Notebook Storage (Filesystem, Git, etc) Interpreter Manager Auth / ACL Interpreter (kernel) Interpreter (kernel) Interpreter (kernel)
  • 29. #ODSC 2018 ● Easier to implement / manage ● Notebook sharing is decoupled with execution environment ● e.g. ○ JupyterHub ○ AWS Sagemaker Reverse Proxy Single user Notebook server Kernel Single user Notebook server Kernel Notebook Storage Multi user Notebook server Notebook Storage Kernel Kernel Kernel Browser Browser ● More complex to implement / manage ● Notebook sharing is coupled with execution environment. Can expect more integrated sharing environment. ● e.g. ○ Apache Zeppelin ○ ZEPL ○ Google Colab
  • 30. #ODSC 2018 Reproducibility on notebook 1. Configure environment a. %env, %python.config, %spark.config 2. Install libraries a. !pip install, %spark.dep 3. Load data 4. Your work 5. Print libraries a. !pip list, %conda list
  • 31. #ODSC 2018 Notebook to production Built-in scheduler External scheduler Zeppelin zepl REST api
  • 32. #ODSC 2018 Notebook to production Rewrite :) and submit In C/C++, Python, scala ... Export, Submit notebook as a application - Run notebook in command line - Export notebook as a spark application - https://github.com/CODAIT/notebook-exporter/tree/master /notebook-exporter Data pipeline
  • 33. #ODSC 2018 Conclusion ● Share notebook ● Share Data ● Multi-user environment Enables collaboration} Things to consider ● Reproducibility ● Notebook to production