ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

REPRODUCIBLE RESEARCH AT
SCALE WITH APACHE SPARK
AND ZEPPELIN NOTEBOOK
CAROLYN DUBY
SOLUTIONS ENGINEER, NORTHEAST
HORTONWORKS
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Boston | May 3-5th

ABOUT CAROLYN DUBY
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training

https://www.meetup.com/futureofdata-
boston/

AGENDA
• What is Reproducible Research? Why do it?
• What does it take to do Reproducible Research at scale?
• Example with Apache Zeppelin and Spark

REPRODUCIBLE RESEARCH
• Complete details of data analysis
methods yielding conclusions
• Replication of research on
independently collected data
• Gold standard

BENEFITS
• Individual productivity
• Effective peer review
• Answer questions more quickly
• Correct errors
• Apply methods to other experiments
• Increased quality and respect for results
• Justify business decisions

CHALLENGES
• Large data sets
• Complex analysis
• Data lineage
• Streaming data
• Limited space in publications

HOW TO DO REPRODUCIBLE RESEARCH
• Define Platform
• Record all versions of analysis software and installation procedures
• Analyze Data
• Record all commands to acquire, clean, organize, analyze
• Store intermediate results
• Version control
• Share Methods and Results
• Publish
• Share full details

REPRODUCIBLE RESEARCH WITH APACHE
OPEN SOURCE
• Apache Spark version 2.1
• Cleaning and analysis of large data sets
• http://spark.apache.org
• Apache Zeppelin Notebook 0.7.0
• Capture automated commands
• Visualize data for exploration and results
• https://zeppelin.apache.org

APACHE SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive,
Phoenix, S3, etc

SPARK LIBRARIES
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest

ZEPPELIN
• Notebook
• Combine mark down, shell, spark, sql commands in same notebook
• Easily integrate with Spark in different languages
• Visualize data using graphs and pivot charts
• Share notebooks or paragraphs

ARCHITECTURE
Spark Driver
Zeppelin Spark
Application
Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Client Browser

GETTING STARTED
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud

ZEPPELIN BASICS
• Notes are composed of paragraphs
• Paragraph contains code or markdown
• Specify interpreter - % <interpreter name> or blank for default
• Enter commands
• Click play button to run code on cluster
• Results display in paragraph
• Code and results can be shown or hidden

Create/open
Note
Note tools
Paragraph
tools
User and note
configuration
Markdown
Interpreter (%md)
(editor hidden)
Shell
Interpreter (%sh)
(editor shown)

MARKDOWN
# headers
%md
hyperlink
show/hide
editor
run paragraph
run all paragraphs
block quote

EXAMPLE
• Crimes in Chicago Kaggle
Dataset
• Interesting opportunities
for time series and
prediction
https://www.kaggle.com/currie32/crimes-in-chicago

DATA PIPELINE
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze

OPTIMIZING DATA CLEANING
• Keep a raw copy
• Web sites go away, remove data, change links and interfaces
• Store the clean data
• Saves time each time you analyze
• Use a standard format (Optimized Row Columnar(ORC),
parquet, etc)
• Query data with hive
• Shared location if security and privacy requirements allow
• Collaborate by sharing data with others

ACQUIRE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze

DOCUMENTING PLATFORM PREPARATION

• %sh interpreter
• Bash shell
• Show
intermediate
results for debug

CLEAN DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze

SPARK IS FAST BUT LAZY
• Transformations
• Specify which data to read
• Modify data
• Actions
• Show data
• Write data

Header and
Case data on
Same CSV line

Apply numeric
types
On clean data
Add some
columns to make
aggregations
easier

Table for SQL
Save clean
data as ORC

EXPLORE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze

Read clean
data and
create table

Specify query
Select visualization
Configure visualization
X
Y

OTHER INTERPRETERS
• Matplotlib with pyspark
• Angular – maps

ANGULAR WITH INPUT
https://community.hortonworks.com/articles/75834/using-angular-within-apache-
zeppelin-to-create-cus.html

ANALYZE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze

MODEL PIPELINES
• https://spark.apache.org/docs/2.1.0/ml-pipeline.html
TRANFORME
RTransformers Estimator
Pipeline
Model
Training
data
Test
data
Predictio
ns

TIPS AND TRICKS
• Use val for variables used across paragraphs
• Vars can yield unpredictable results when run out of order
• Break up big notebooks
• Store intermediate results
• Avoid reloading and recalculating the same values
• Verify your notebook by running all paragraphs

SHARING NOTEBOOKS
• Share link to notebook or paragraph
• Readers access your Zeppelin server
• Use logins and permissions
• Export to JSON and save to shared file
• Readers get JSON from shared file (github, cloud, etc)
• Import to their Zeppelin server
• Sync your to Zeppelin Hub (https://www.zeppelinhub.com)
• Share Zeppelin Hub link with readers
• Free version for small teams

REUSING NOTEBOOKS
• Clone notebook
• Copy code from notebook
• Build libraries for use in notebooks

VERSIONING NOTEBOOKS
• Track changes to notebook code or text
• Go back to a previously known good version
• Compare versions to see differences

CONFIGURE ZEPPELIN VERSION CONTROL
1
2
3
4
5Set storage to
GitNotebookRepo in
Zeppelin-env.xml
Restart Zeppelin server

SAVE NOTE VERSION
Enter version
comment and click
Commit

VIEWING VERSION HISTORY AND PREVIOUS
VERSIONS
Pull down the list of
versions
Select a version
Zeppelin shows
content for that
version
Head goes to the
latest version

GIT REPO ON ZEPPELIN SERVER
Zeppelin creates git repo
In notebook directory

REPRODUCIBLE RESEARCH
• Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple
Rules for Reproducible Computational Research. PLoS Comput
Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285
• http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.p
cbi.1003285&type=printable

ZEPPELIN AND SPARK
• Spark
• https://dzone.com/articles/try-the-latest-innovations-in-apache-
spark-and-apa
• https://hortonworks.com/hadoop-tutorial/learning-spark-zeppelin/
• https://spark.apache.org/docs/2.1.0/ml-pipeline.html
• Example Notebooks
• https://github.com/hortonworks-gallery/zeppelin-notebooks

ZEPPELIN INTERPRETERS
• Markdown syntax
• http://daringfireball.net/projects/markdown/syntax

EXAMPLE
• Chicago Crimes Data Set
• https://www.kaggle.com/currie32/crimes-in-Chicago
• Example notebooks
• https://github.com/carolynduby/ODSC2017

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Similar to ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark (20)

Recently uploaded

Recently uploaded (20)

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark