Collaborative data science and how to build a data science toolchain around notebook technologies odsc 2018 boston (1)
1. Collaborative data science and
build data science tool chain around
Notebook technologies
Creator of Apache Zeppelin
Co-Founder, CTO
Moon soo Lee
moon@zepl.com
2. #ODSC 2018
Who am I
A big believer that data science notebook changes how people collaborate
Creator of Apache Zeppelin
Co-founder
https://github.com/Leemoonsoo
www.zepl.com
3. #ODSC 2018
It was 2013, really wanted to have
interactive analytics interface for .
4. #ODSC 2018
Started an opensource project -
Zeppelin http://zeppelin-project.org/
data science notebook.
Became an project in 2016.
http://zeppelin.apache.org
15. #ODSC 2018
Github
● Store notebook in github
● Versioning
● Github provides .ipynb viewer
● Fork / pull request / merge
● Private / Public / Team / Org
● Hard to apply Notebook level ACL
● Not easy for Non-engineers
16. #ODSC 2018
nbviewer
● Publishing notebook
● Share notebook by sharing link
● Easy use
● No access control
Nbconvert (endering ipynb to static HTML) as a webservice
18. #ODSC 2018
Apache Zeppelin
● Share notebook with ACL, Read/Write/Execute
● In case of Jupyter notebook, need to convert .ipynb to zeppelin format in
command line.
19. #ODSC 2018
Commercial services for notebook sharing
Google Colab
● Share notebook through google drive
● View/Edit/Run .ipynb notebook using Colab
● Realtime collaboration
ZEPL
● Notebook level ACL
● View/Edit/Run .ipynb and Zeppelin notebook
● Realtime collaboration
● Import existing notebook from git/s3 storage
www.zepl.com
21. #ODSC 2018
DON’Ts
● Email attach
● Direct send
● Share through USB
● ...
Email attach
Local copy in laptop
USB drive
22. #ODSC 2018
DO’s
● Provide access to the same dataset
● Access control capability
● Horizontal scalability
23. #ODSC 2018
Data catalog
● Provides location of data, what it means and how to load
○ e.g.
● Catalogue need to be accessible / searchable / annotatable
● Many different way to build depends on team / infra
○ Hive Metastore as a data catalog
○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog)
○ Data catalog / publishing software (e.g. CKAN, DKAN)
○ Custom built on top of RDBMS, Nosql, Indexing engine
○ Build data catalog using Notebook
Dataset Location Schema Note
Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. ….
Images s3://service/images 512x256 pixel images Images are collected from profile photo...
24. #ODSC 2018
Build data catalog using Notebook
● Flexible enough to describe data
● Searchable, shareable, annotatable
● Programmatic generation
27. #ODSC 2018
Sign in and Run
Install libraries and
Install notebook and
Configure driver, environments and
Request access to data and
Setup access to notebook repo and
….
Run
29. #ODSC 2018
● Easier to implement / manage
● Notebook sharing is decoupled with
execution environment
● e.g.
○ JupyterHub
○ AWS Sagemaker
Reverse Proxy
Single user
Notebook server
Kernel
Single user
Notebook server
Kernel
Notebook
Storage
Multi user
Notebook server
Notebook
Storage
Kernel Kernel Kernel
Browser
Browser
● More complex to implement / manage
● Notebook sharing is coupled with execution
environment. Can expect more integrated
sharing environment.
● e.g.
○ Apache Zeppelin
○ ZEPL
○ Google Colab
30. #ODSC 2018
Reproducibility on notebook
1. Configure environment
a. %env, %python.config, %spark.config
2. Install libraries
a. !pip install, %spark.dep
3. Load data
4. Your work
5. Print libraries
a. !pip list, %conda list
31. #ODSC 2018
Notebook to production
Built-in scheduler External scheduler
Zeppelin
zepl
REST api
32. #ODSC 2018
Notebook to production
Rewrite :) and submit
In C/C++, Python, scala ...
Export, Submit notebook as a application
- Run notebook in command line
- Export notebook as a spark application
- https://github.com/CODAIT/notebook-exporter/tree/master
/notebook-exporter
Data pipeline
33. #ODSC 2018
Conclusion
● Share notebook
● Share Data
● Multi-user environment
Enables collaboration}
Things to consider
● Reproducibility
● Notebook to production