ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
1. REPRODUCIBLE RESEARCH AT
SCALE WITH APACHE SPARK
AND ZEPPELIN NOTEBOOK
CAROLYN DUBY
SOLUTIONS ENGINEER, NORTHEAST
HORTONWORKS
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Boston | May 3-5th
2. ABOUT CAROLYN DUBY
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training
4. AGENDA
• What is Reproducible Research? Why do it?
• What does it take to do Reproducible Research at scale?
• Example with Apache Zeppelin and Spark
5. REPRODUCIBLE RESEARCH
• Complete details of data analysis
methods yielding conclusions
• Replication of research on
independently collected data
• Gold standard
6. BENEFITS
• Individual productivity
• Effective peer review
• Answer questions more quickly
• Correct errors
• Apply methods to other experiments
• Increased quality and respect for results
• Justify business decisions
7. CHALLENGES
• Large data sets
• Complex analysis
• Data lineage
• Streaming data
• Limited space in publications
8. HOW TO DO REPRODUCIBLE RESEARCH
• Define Platform
• Record all versions of analysis software and installation procedures
• Analyze Data
• Record all commands to acquire, clean, organize, analyze
• Store intermediate results
• Version control
• Share Methods and Results
• Publish
• Share full details
9. REPRODUCIBLE RESEARCH WITH APACHE
OPEN SOURCE
• Apache Spark version 2.1
• Cleaning and analysis of large data sets
• http://spark.apache.org
• Apache Zeppelin Notebook 0.7.0
• Capture automated commands
• Visualize data for exploration and results
• https://zeppelin.apache.org
10. APACHE SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive,
Phoenix, S3, etc
11. SPARK LIBRARIES
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest
12. ZEPPELIN
• Notebook
• Combine mark down, shell, spark, sql commands in same notebook
• Easily integrate with Spark in different languages
• Visualize data using graphs and pivot charts
• Share notebooks or paragraphs
15. GETTING STARTED
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud
16. ZEPPELIN BASICS
• Notes are composed of paragraphs
• Paragraph contains code or markdown
• Specify interpreter - % <interpreter name> or blank for default
• Enter commands
• Click play button to run code on cluster
• Results display in paragraph
• Code and results can be shown or hidden
19. EXAMPLE
• Crimes in Chicago Kaggle
Dataset
• Interesting opportunities
for time series and
prediction
https://www.kaggle.com/currie32/crimes-in-chicago
21. OPTIMIZING DATA CLEANING
• Keep a raw copy
• Web sites go away, remove data, change links and interfaces
• Store the clean data
• Saves time each time you analyze
• Use a standard format (Optimized Row Columnar(ORC),
parquet, etc)
• Query data with hive
• Shared location if security and privacy requirements allow
• Collaborate by sharing data with others
55. TIPS AND TRICKS
• Use val for variables used across paragraphs
• Vars can yield unpredictable results when run out of order
• Break up big notebooks
• Store intermediate results
• Avoid reloading and recalculating the same values
• Verify your notebook by running all paragraphs
56. SHARING NOTEBOOKS
• Share link to notebook or paragraph
• Readers access your Zeppelin server
• Use logins and permissions
• Export to JSON and save to shared file
• Readers get JSON from shared file (github, cloud, etc)
• Import to their Zeppelin server
• Sync your to Zeppelin Hub (https://www.zeppelinhub.com)
• Share Zeppelin Hub link with readers
• Free version for small teams
61. VIEWING VERSION HISTORY AND PREVIOUS
VERSIONS
Pull down the list of
versions
Select a version
Zeppelin shows
content for that
version
Head goes to the
latest version
62. GIT REPO ON ZEPPELIN SERVER
Zeppelin creates git repo
In notebook directory
65. REPRODUCIBLE RESEARCH
• Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple
Rules for Reproducible Computational Research. PLoS Comput
Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285
• http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.p
cbi.1003285&type=printable
66. ZEPPELIN AND SPARK
• Spark
• https://dzone.com/articles/try-the-latest-innovations-in-apache-
spark-and-apa
• https://hortonworks.com/hadoop-tutorial/learning-spark-zeppelin/
• https://spark.apache.org/docs/2.1.0/ml-pipeline.html
• Example Notebooks
• https://github.com/hortonworks-gallery/zeppelin-notebooks
68. EXAMPLE
• Chicago Crimes Data Set
• https://www.kaggle.com/currie32/crimes-in-Chicago
• Example notebooks
• https://github.com/carolynduby/ODSC2017