SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
SPARKLY NOTEBOOK: INTERACTIVE
ANALYSIS AND VISUALIZATION WITH SPARK
FELIX CHEUNG
APRIL 2015
HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
SETUP
• Spark on CDH cluster
• Vagrant - 2-nodes - custom provisioning
AGENDA
• IPython + PySpark cluster
• Zeppelin
• Spark’s Streaming k-means
• Lightning
SPARK - 10 SEC INTRODUCTION
• Spark
• Spark SQL + Data Frame + data source
• Spark Streaming
• MLlib
• GraphX
It’s a lot of time looking at data..
REPL
• Read-Eval-Print-Loop
Set of REPL related to Spark…
$	
  spark-­‐shell	
  
Welcome	
  to	
  
	
  	
  	
  	
  	
  	
  ____	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  __	
  
	
  	
  	
  	
  	
  /	
  __/__	
  	
  ___	
  _____/	
  /__	
  
	
  	
  	
  	
  _	
  /	
  _	
  /	
  _	
  `/	
  __/	
  	
  '_/	
  
	
  	
  	
  /___/	
  .__/_,_/_/	
  /_/_	
  	
  	
  version	
  1.2.0-­‐SNAPSHOT	
  
	
  	
  	
  	
  	
  	
  /_/	
  
Using	
  Scala	
  version	
  2.10.4	
  (Java	
  HotSpot(TM)	
  64-­‐Bit	
  Server	
  VM,	
  Java	
  1.7.0_67)	
  
Type	
  in	
  expressions	
  to	
  have	
  them	
  evaluated.	
  
Type	
  :help	
  for	
  more	
  information.	
  
15/04/15	
  11:31:28	
  INFO	
  SparkILoop:	
  Created	
  spark	
  context..	
  
Spark	
  context	
  available	
  as	
  sc.	
  
scala>	
  val	
  a	
  =	
  sc.parallelize(1	
  to	
  100)	
  
a:	
  org.apache.spark.rdd.RDD[Int]	
  =	
  ParallelCollectionRDD[0]	
  at	
  parallelize	
  at	
  <console>:12	
  
scala>	
  a.collect.foreach(x	
  =>	
  println(x))	
  
1	
  
2	
  
3	
  
4
GOOD
• See results instantly
NOT SO GOOD
• Ok as an IDE
• No Save / Repeat
• No visualization
NOTEBOOK
Jupyter
IPython will continue to exist as a Python kernel for Jupyter, but
the notebook and other language-agnostic parts of IPython will
move to new projects under the Jupyter name. IPython 3.0 will
be the last monolithic release of IPython.
!
“IPython” http://ipython.org/
• interactive shell
• browser-based notebook
• 'Kernel'
• great support for visualization library (eg. matplotlib)
• built on pyzmq, tornado
IPYTHON/JUPYTER
IPYTHON NOTEBOOK

NOTEBOOK == BROWSER-BASED REPL
IPython Notebook is a web-based interactive
computational environment for creating IPython
notebooks. An IPython notebook is a JSON
document containing an ordered list of input/output
cells which can contain code, text, mathematics,
plots and rich media.
MATPLOTLIB
matplotlib tries to make easy things easy and hard things
possible. You can generate plots, histograms, power
spectra, bar charts, errorcharts, scatterplots, etc, with just a
few lines of code, with familiar MATLAB APIs.
plt.barh(y_pos,	
  performance,	
  xerr=error,	
  
align='center',	
  alpha=0.4)	
  
plt.yticks(y_pos,	
  people)	
  
plt.xlabel('Performance')	
  
plt.title('How	
  fast	
  do	
  you	
  want	
  to	
  go	
  today?')	
  
plt.show()
PYSPARK
• Spark on Python, this serves as the Kernel,
integrating with IPython
• Each notebook spins up a new instance of the
Kernel (ie. PySpark running as the Spark Driver, in
different deploy mode Spark/PySpark supports)
(All notebook examples are a subset of those in
the Meetup reconstructed here)
Markdown
Spark in
Python
Source: http://nbviewer.ipython.org/github/ResearchComputing/
scientific_computing_tutorials/blob/master/spark/02_word_count.ipynb
WORD2VEC EXAMPLE
Word2Vec computes distributed vector
representation of words. Distributed vector
representation is showed to be useful in many
natural language processing applications such as
named entity recognition, disambiguation, parsing,
tagging and machine translation.

https://code.google.com/p/word2vec/
Spark MLlib implements the Skip-gram approach.
With Skip-gram we want to predict a window of
words given a single word.
WORD2VEC DATASET
Wikipedia dump http://mattmahoney.net/dc/
textdata
grep	
  -­‐o	
  -­‐E	
  'w+(W+w+){0,15}'	
  text8	
  >	
  text8_lines	
  
then randomly sampled to ~200k lines
matplotlib: http://matplotlib.org
Seaborn: http://stanford.edu/~mwaskom/software/seaborn/
Bokeh: http://bokeh.pydata.org/en/latest/
MORE VISUALIZATIONS Seaborn
Bokeh
matplotlib
SETUP
To setup IPython
• Python 2.7.9 (separate from CentOS default 2.6.6), on all
nodes
• matplotlib, on the host running IPython
To run IPython with the PySpark Kernel, set these in the environment

(Please check out my handy script on github)
!
!
!
PYSPARK_PYTHON command to run python, eg. “python2.7”
PYSPARK_DRIVER_PYTHON command to run ipython
PYSPARK_DRIVER_PYTHON_OPTS “notebook —profile”
PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode
YARN_CONF_DIR if YARN mode
LD_LIBRARY_PATH for matplotlib
IPYTHON/JUPYTER KERNELS
• IPython
• IGo
• Bash
• IR
• IHaskell
• IMatlab
• ICSharp
• IScala
• IRuby
• IJulia
.. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-
languages
ZEPPELIN
Apache Zeppelin (incubating) is interactive data analytics environment
for distributed data processing system. It provides beautiful interactive
web-based interface, data visualization, collaborative work
environment and many other nice features to make your data analytics
more fun and enjoyable.
Zeppelin has been incubating since Dec 2014.

https://zeppelin.incubator.apache.org/
shell script &

calling library package
Load and process data

with Spark
SQL query powered by
Spark SQL -

progress &

parameterization via
dynamic form
Python &

data passing across
languages (interpreters)
ZEPPELIN ARCHITECTURE
Realtime collaboration
- enabled by
websocket
communications
Frontend: AngularJS 

Backend server: Java 

Interpreters: Java

Visualization: NVD3
INTERPRETERS
• Spark group
• Spark (Scala)
• PySpark
• Spark SQL
• Dependency
• Markdownjs
• Shell
• Hive
• Coming: jdbc, Tajo, etc.
CLUSTERING
• Clustering tries to find natural groupings in
data. It puts objects into groups in which
those within a group are more similar to each
other than to those in other groups.
• Unsupervised learning
K-MEANS
• First, given an initial set of k cluster centers,
we nd which cluster each data point is
closest to
• Then, we compute the average of each of the
new clusters and use the result to update our
cluster centers
K-MEANS|| IN MLLIB
• a parallelized variant of the k-means++

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Parameters:
• k is the number of desired clusters.
• maxIterations is the maximum number of iterations to run.
• initializationMode specifies either random initialization or initialization via
k-means||.
• runs is the number of times to run the k-means algorithm (k-means is not
guaranteed to nd a globally optimal solution, and when run multiple
times on a given dataset, the algorithm returns the best clustering result).
• initializationSteps determines the number of steps in the k-means||
algorithm.
• epsilon determines the distance threshold within which we consider k-
means to have converged.
CASE STUDY:

K-MEANS - ZEPPELIN
Details on github at: http://bit.ly/1JWOPh8
ANOMALY DETECTION WITH K-MEANS
Using Spark DataFrame, csv data source, to process KDDCup’99 data

Scoring with different k values
COMING SOON (NOW!)
Realtime updates
Dashboard
Spark-notebook: https://github.com/andypetrella/spark-notebook
ISpark: https://github.com/tribbloid/ISpark
Spark Kernel: https://github.com/ibm-et/spark-kernel
Jove: https://github.com/jove-sh/jove-notebook
Beaker: https://github.com/twosigma/beaker-notebook
OTHER NOTEBOOKS
• Spark-notebook
• ISpark
• Spark Kernel
• Jove Notebook
• Beaker
• Databricks Cloud notebook
PART 2
STREAMING K-MEANS
WHY STREAMING?
• Train - model - predict works well on static
data
• What if data is
• Coming in streams
• Changing over time?
STREAMING K-MEANS DESIGN
• Proposed by Dr Jeremy Freeman (here)
STREAMING K-MEANS
• key concept: forgetfulness
• balances the relative importance of new
data versus past history
• half-life
• time it takes before past data contributes to
only one half of the current model
STREAMING K-MEANS
• time unit
• batches (which have a fixed duration in
time), or points
• eliminate dying clusters

VISUALIZING

STREAMING K-MEANS - LIGHTNING
LIGHTNING
• Lightning - data visualization server

http://lightning-viz.org
• provides API-based access to reproducible, web-
based, interactive visualizations. It includes a core set
of visualization types, but is built for extendability
and customization. Lightning supports modern
libraries like d3.js and three.js, and is designed for
interactivity over large data sets and continuously
updating data streams.
VISUALIZING STREAMING K-
MEANS ON IPYTHON + LIGHTNING
RUNNING LIGHTNING
• API: node.js, Python, Scala
• Extension support for custom chart (eg. d3.js)
• Requirements:
• Postgres recommended (SQLlite ok)
• node.js (npm , gulp)
The Freeman Lab at Janelia Research Campus uses Lightning to visualize
large-scale neural recordings from zebrash, in collaboration with the
Ahrens Lab
SPARK STREAMING K-MEANS
DEMO
Environment
• requires: numpy, scipy, scikit-learn
• IPython/Python requires: lightning-python package
Demo consists of 3 parts:

https://github.com/felixcheung/spark-ml-streaming
• Python driver script, data generator
• Scala job - Spark Streaming & Streaming k-means
• IPython notebook to process result, visualize with Lightning

Originally this was part of the Python driver script - it has
been modied for this talk to run within IPython
CHALLENGES
• Package management
• Version/build conflicts!
YOU CAN RUN THIS TOO!
• Notebooks available at http://bit.ly/1JWOPh8
• Everything is heavily scripted and automated

Vagrant cong for local, virtual environment
available at http://bit.ly/1DB3OLw
QUESTION?
!
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
!

Weitere ähnliche Inhalte

Was ist angesagt?

ParaViewでCSVの点群を表示する
ParaViewでCSVの点群を表示するParaViewでCSVの点群を表示する
ParaViewでCSVの点群を表示するRyogaSato1
 
量子アニーリング解説 1
量子アニーリング解説 1量子アニーリング解説 1
量子アニーリング解説 1Kohta Ishikawa
 
MySQLerの7つ道具
MySQLerの7つ道具MySQLerの7つ道具
MySQLerの7つ道具yoku0825
 
マルコフ連鎖モンテカルロ法
マルコフ連鎖モンテカルロ法マルコフ連鎖モンテカルロ法
マルコフ連鎖モンテカルロ法Masafumi Enomoto
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
 
第22回オープンデータトーク 地理データ形式のこれから
第22回オープンデータトーク 地理データ形式のこれから第22回オープンデータトーク 地理データ形式のこれから
第22回オープンデータトーク 地理データ形式のこれからIWASAKI NOBUSUKE
 
FPGAによる大規模データ処理の高速化
FPGAによる大規模データ処理の高速化FPGAによる大規模データ処理の高速化
FPGAによる大規模データ処理の高速化Kazunori Sato
 
Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Takashi J OZAKI
 
量子コンピュータの基礎から応用まで
量子コンピュータの基礎から応用まで量子コンピュータの基礎から応用まで
量子コンピュータの基礎から応用までQunaSys
 
Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念Shinya Mori (@mosuke5)
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIANVIDIA Japan
 
Juliaで並列計算
Juliaで並列計算Juliaで並列計算
Juliaで並列計算Shintaro Fukushima
 
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 online
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 onlineGeotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 online
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 onlineRyousuke Wayama
 
Knowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideKnowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideXiachongFeng
 
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...NTT DATA Technology & Innovation
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBasealexbaranau
 
第3回 配信講義 計算科学技術特論A (2021)
第3回 配信講義 計算科学技術特論A (2021) 第3回 配信講義 計算科学技術特論A (2021)
第3回 配信講義 計算科学技術特論A (2021) RCCSRENKEI
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache SparkについてBrainPad Inc.
 
Rでisomap(多様体学習のはなし)
Rでisomap(多様体学習のはなし)Rでisomap(多様体学習のはなし)
Rでisomap(多様体学習のはなし)Kohta Ishikawa
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 

Was ist angesagt? (20)

ParaViewでCSVの点群を表示する
ParaViewでCSVの点群を表示するParaViewでCSVの点群を表示する
ParaViewでCSVの点群を表示する
 
量子アニーリング解説 1
量子アニーリング解説 1量子アニーリング解説 1
量子アニーリング解説 1
 
MySQLerの7つ道具
MySQLerの7つ道具MySQLerの7つ道具
MySQLerの7つ道具
 
マルコフ連鎖モンテカルロ法
マルコフ連鎖モンテカルロ法マルコフ連鎖モンテカルロ法
マルコフ連鎖モンテカルロ法
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
 
第22回オープンデータトーク 地理データ形式のこれから
第22回オープンデータトーク 地理データ形式のこれから第22回オープンデータトーク 地理データ形式のこれから
第22回オープンデータトーク 地理データ形式のこれから
 
FPGAによる大規模データ処理の高速化
FPGAによる大規模データ処理の高速化FPGAによる大規模データ処理の高速化
FPGAによる大規模データ処理の高速化
 
Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~
 
量子コンピュータの基礎から応用まで
量子コンピュータの基礎から応用まで量子コンピュータの基礎から応用まで
量子コンピュータの基礎から応用まで
 
Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念Kubernetesを使う上で抑えておくべきAWSの基礎概念
Kubernetesを使う上で抑えておくべきAWSの基礎概念
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
Juliaで並列計算
Juliaで並列計算Juliaで並列計算
Juliaで並列計算
 
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 online
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 onlineGeotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 online
Geotiff.jsで始めるリアルタイム演算 in foss4g japan 2020 online
 
Knowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideKnowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical Guide
 
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
第3回 配信講義 計算科学技術特論A (2021)
第3回 配信講義 計算科学技術特論A (2021) 第3回 配信講義 計算科学技術特論A (2021)
第3回 配信講義 計算科学技術特論A (2021)
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache Sparkについて
 
Rでisomap(多様体学習のはなし)
Rでisomap(多様体学習のはなし)Rでisomap(多様体学習のはなし)
Rでisomap(多様体学習のはなし)
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 

Andere mochten auch

Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data VisualizationEamonn Maguire
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Eugene O'Loughlin
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Manual de programacion_con_robots_para_la_escuela
Manual de programacion_con_robots_para_la_escuelaManual de programacion_con_robots_para_la_escuela
Manual de programacion_con_robots_para_la_escuelaAngel De las Heras
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기SangWoo Kim
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualizationZach Gemignani
 

Andere mochten auch (13)

Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
Data Visualization Tools
Data Visualization ToolsData Visualization Tools
Data Visualization Tools
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Manual de programacion_con_robots_para_la_escuela
Manual de programacion_con_robots_para_la_escuelaManual de programacion_con_robots_para_la_escuela
Manual de programacion_con_robots_para_la_escuela
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기Apache Zeppelin으로 데이터 분석하기
Apache Zeppelin으로 데이터 분석하기
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
 

Ähnlich wie Sparkly Notebook: Interactive Analysis and Visualization with Spark

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONAdrian Cockcroft
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKData Con LA
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsLuciano Resende
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013Travis Oliphant
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_toolsMarco Quartulli
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetupStavros Kontopoulos
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetupTakeshi Yamamuro
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!Maziyar PANAHI
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...Ron Reiter
 

Ähnlich wie Sparkly Notebook: Interactive Analysis and Visualization with Spark (20)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 

KĂźrzlich hochgeladen

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

KĂźrzlich hochgeladen (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Sparkly Notebook: Interactive Analysis and Visualization with Spark

  • 1. SPARKLY NOTEBOOK: INTERACTIVE ANALYSIS AND VISUALIZATION WITH SPARK FELIX CHEUNG APRIL 2015 HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
  • 2. SETUP • Spark on CDH cluster • Vagrant - 2-nodes - custom provisioning
  • 3. AGENDA • IPython + PySpark cluster • Zeppelin • Spark’s Streaming k-means • Lightning
  • 4.
  • 5. SPARK - 10 SEC INTRODUCTION • Spark • Spark SQL + Data Frame + data source • Spark Streaming • MLlib • GraphX
  • 6. It’s a lot of time looking at data..
  • 8. Set of REPL related to Spark…
  • 9. $  spark-­‐shell   Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _  /  _  /  _  `/  __/    '_/        /___/  .__/_,_/_/  /_/_      version  1.2.0-­‐SNAPSHOT              /_/   Using  Scala  version  2.10.4  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  1.7.0_67)   Type  in  expressions  to  have  them  evaluated.   Type  :help  for  more  information.   15/04/15  11:31:28  INFO  SparkILoop:  Created  spark  context..   Spark  context  available  as  sc.   scala>  val  a  =  sc.parallelize(1  to  100)   a:  org.apache.spark.rdd.RDD[Int]  =  ParallelCollectionRDD[0]  at  parallelize  at  <console>:12   scala>  a.collect.foreach(x  =>  println(x))   1   2   3   4
  • 11. NOT SO GOOD • Ok as an IDE • No Save / Repeat • No visualization
  • 13.
  • 14. Jupyter IPython will continue to exist as a Python kernel for Jupyter, but the notebook and other language-agnostic parts of IPython will move to new projects under the Jupyter name. IPython 3.0 will be the last monolithic release of IPython. ! “IPython” http://ipython.org/ • interactive shell • browser-based notebook • 'Kernel' • great support for visualization library (eg. matplotlib) • built on pyzmq, tornado IPYTHON/JUPYTER
  • 15. IPYTHON NOTEBOOK
 NOTEBOOK == BROWSER-BASED REPL IPython Notebook is a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.
  • 16. MATPLOTLIB matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code, with familiar MATLAB APIs. plt.barh(y_pos,  performance,  xerr=error,   align='center',  alpha=0.4)   plt.yticks(y_pos,  people)   plt.xlabel('Performance')   plt.title('How  fast  do  you  want  to  go  today?')   plt.show()
  • 17. PYSPARK • Spark on Python, this serves as the Kernel, integrating with IPython • Each notebook spins up a new instance of the Kernel (ie. PySpark running as the Spark Driver, in different deploy mode Spark/PySpark supports)
  • 18. (All notebook examples are a subset of those in the Meetup reconstructed here)
  • 22.
  • 23. WORD2VEC EXAMPLE Word2Vec computes distributed vector representation of words. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.
 https://code.google.com/p/word2vec/ Spark MLlib implements the Skip-gram approach. With Skip-gram we want to predict a window of words given a single word.
  • 24. WORD2VEC DATASET Wikipedia dump http://mattmahoney.net/dc/ textdata grep  -­‐o  -­‐E  'w+(W+w+){0,15}'  text8  >  text8_lines   then randomly sampled to ~200k lines
  • 25.
  • 26.
  • 27. matplotlib: http://matplotlib.org Seaborn: http://stanford.edu/~mwaskom/software/seaborn/ Bokeh: http://bokeh.pydata.org/en/latest/ MORE VISUALIZATIONS Seaborn Bokeh matplotlib
  • 28. SETUP To setup IPython • Python 2.7.9 (separate from CentOS default 2.6.6), on all nodes • matplotlib, on the host running IPython To run IPython with the PySpark Kernel, set these in the environment
 (Please check out my handy script on github) ! ! ! PYSPARK_PYTHON command to run python, eg. “python2.7” PYSPARK_DRIVER_PYTHON command to run ipython PYSPARK_DRIVER_PYTHON_OPTS “notebook —prole” PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode YARN_CONF_DIR if YARN mode LD_LIBRARY_PATH for matplotlib
  • 29. IPYTHON/JUPYTER KERNELS • IPython • IGo • Bash • IR • IHaskell • IMatlab • ICSharp • IScala • IRuby • IJulia .. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other- languages
  • 31. Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable. Zeppelin has been incubating since Dec 2014.
 https://zeppelin.incubator.apache.org/
  • 32.
  • 33. shell script &
 calling library package Load and process data
 with Spark
  • 34. SQL query powered by Spark SQL -
 progress &
 parameterization via dynamic form
  • 35. Python &
 data passing across languages (interpreters)
  • 36. ZEPPELIN ARCHITECTURE Realtime collaboration - enabled by websocket communications Frontend: AngularJS 
 Backend server: Java 
 Interpreters: Java
 Visualization: NVD3
  • 37. INTERPRETERS • Spark group • Spark (Scala) • PySpark • Spark SQL • Dependency • Markdownjs • Shell • Hive • Coming: jdbc, Tajo, etc.
  • 38. CLUSTERING • Clustering tries to nd natural groupings in data. It puts objects into groups in which those within a group are more similar to each other than to those in other groups. • Unsupervised learning
  • 39. K-MEANS • First, given an initial set of k cluster centers, we nd which cluster each data point is closest to • Then, we compute the average of each of the new clusters and use the result to update our cluster centers
  • 40.
  • 41. K-MEANS|| IN MLLIB • a parallelized variant of the k-means++
 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf Parameters: • k is the number of desired clusters. • maxIterations is the maximum number of iterations to run. • initializationMode species either random initialization or initialization via k-means||. • runs is the number of times to run the k-means algorithm (k-means is not guaranteed to nd a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result). • initializationSteps determines the number of steps in the k-means|| algorithm. • epsilon determines the distance threshold within which we consider k- means to have converged.
  • 43. Details on github at: http://bit.ly/1JWOPh8 ANOMALY DETECTION WITH K-MEANS Using Spark DataFrame, csv data source, to process KDDCup’99 data
 Scoring with different k values
  • 47. Spark-notebook: https://github.com/andypetrella/spark-notebook ISpark: https://github.com/tribbloid/ISpark Spark Kernel: https://github.com/ibm-et/spark-kernel Jove: https://github.com/jove-sh/jove-notebook Beaker: https://github.com/twosigma/beaker-notebook OTHER NOTEBOOKS • Spark-notebook • ISpark • Spark Kernel • Jove Notebook • Beaker • Databricks Cloud notebook
  • 49. WHY STREAMING? • Train - model - predict works well on static data • What if data is • Coming in streams • Changing over time?
  • 50. STREAMING K-MEANS DESIGN • Proposed by Dr Jeremy Freeman (here)
  • 51. STREAMING K-MEANS • key concept: forgetfulness • balances the relative importance of new data versus past history • half-life • time it takes before past data contributes to only one half of the current model
  • 52. STREAMING K-MEANS • time unit • batches (which have a xed duration in time), or points • eliminate dying clusters

  • 55. • Lightning - data visualization server
 http://lightning-viz.org • provides API-based access to reproducible, web- based, interactive visualizations. It includes a core set of visualization types, but is built for extendability and customization. Lightning supports modern libraries like d3.js and three.js, and is designed for interactivity over large data sets and continuously updating data streams. VISUALIZING STREAMING K- MEANS ON IPYTHON + LIGHTNING
  • 56. RUNNING LIGHTNING • API: node.js, Python, Scala • Extension support for custom chart (eg. d3.js) • Requirements: • Postgres recommended (SQLlite ok) • node.js (npm , gulp)
  • 57. The Freeman Lab at Janelia Research Campus uses Lightning to visualize large-scale neural recordings from zebrash, in collaboration with the Ahrens Lab
  • 58. SPARK STREAMING K-MEANS DEMO Environment • requires: numpy, scipy, scikit-learn • IPython/Python requires: lightning-python package Demo consists of 3 parts:
 https://github.com/felixcheung/spark-ml-streaming • Python driver script, data generator • Scala job - Spark Streaming & Streaming k-means • IPython notebook to process result, visualize with Lightning
 Originally this was part of the Python driver script - it has been modied for this talk to run within IPython
  • 59.
  • 60.
  • 61. CHALLENGES • Package management • Version/build conflicts!
  • 62. YOU CAN RUN THIS TOO! • Notebooks available at http://bit.ly/1JWOPh8 • Everything is heavily scripted and automated
 Vagrant cong for local, virtual environment available at http://bit.ly/1DB3OLw