SlideShare a Scribd company logo
1 of 27
Download to read offline
CONNECTING PYTHON TO THE
SPARK ECOSYSTEM
Daniel Rodriguez
Software developer/data scientist
Continuum Analytics
Twitter: @danielfrg
Github: github.com/danielfrg
Spark Summit 2016:
Content
• PyData and Spark
• Python (PyData libs) package management
• Python package management in a cluster
• Multiple options
• Some usage of python (single node) libraries in a
cluster
• Future
2
PyData ecosystem
3
Image by Jake VanderPlas
PyData ecosystem: Numpy
4
• Numpy as the core for the ecosystem:
• Powerful N-dimensional array object
• Broadcasting functions
• Tools for integrating C/C++ and Fortran code
• Linear algebra, Fourier transform, and random
number capabilities
• Single node
• Python only
Spark Ecosystem
5
Image (c) Cloudera, Inc
Third party: spark-packages.org
Spark ecosystem
6
• Fast and general engine for large-scale data
processing
• High level libraries:
• Dataframes, MLlib, SQL, Graphs
• Similar as PyData
• Distributed by design: RDD as core
• JVM with multi language interfaces: Scala,
Java, Python, R
PyData vs Spark
7
• PyData for scientists - maybe they also use MPI
• Spark for a Big Data (after MapReduce)
• Both relatively fine and happy
• Scientist have to scale
• Big data people have to do data science, statistics,
ML
• Data scientists now do both
PySpark issues
8
• PySpark source is not very pythonic (imo)
• Packages
• Performance:
• Serialization and pipes
• RDD of pickle objects
• Py4J to communicate
• Driver and each worker: java <-> python
PyData package management
9
• Numpy and other have C/C++ and Fortran code
• Compiling. Windows
github.com/conda/conda +Conda
Conda-forge: Community powered packages
Package management in a cluster
10
• Lot (100s to 1000s) nodes
• Not really a new problem in DevOps or CM
Search on github for available conda modules
Cloudera Manager and Anaconda Parcel
11
• Most popular Hadoop distribution
• Anaconda parcel for CDH *:
• Static python distribution
• Super easy installation if you have CDH
*: Joint work between Cloudera and Continuum
More info:
- https://docs.continuum.io/anaconda/cloudera
- http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh
I don't use Cloudera
12
You manage your own Hadoop cluster?
For real?
Want to make this happen?
Call me maybe?
Anaconda for cluster management
13
• Agnostic: Cloud, on premise, Air-gap
• Dynamic package and environment management
• Based on salt
• Extra plugins: Jupyter Notebook and more
More info: https://docs.continuum.io/anaconda-cluster/index
Leverage Spark and YARN
14
• New ways to deploy environments:
• Sparkonda: https://github.com/moutai/sparkonda
Leverage Spark: Sparkonda
15
conda create -n sparkonda-test-env python=2.7 pip pandas scikit-learn numpy numba
source activate sparkonda-test-env
pip install sparkonda
sc.addPyFile('path/to/sparkonda_utils.py')
skon.CONDA_ENV_NAME = 'sparkonda-test-env'
skon.CONDA_ENV_LOCATION = ''.join([home_dir,'/miniconda/envs/',skon.CONDA_ENV_NAME])
skon.SC_NUM_EXECUTORS = 2
skon.SC_NUM_CORE_PER_EXECUTOR = 2
skon.pack_conda_env()
skon.distribute_conda_env(sc)
skon.list_cwd_files(sc)
skon.install_conda_env(sc)
skon.set_workers_python_interpreter(sc)
def check_pandas(x): import pandas as pd; return [pd.__version__]
skon.prun(sc, check_pandas, include_broadcast_vars=False)
More info: https://github.com/moutai/sparkonda
Leverage Spark and YARN
16
• Beta
• Conda + Spark: http://quasiben.github.io/blog/
2016/4/15/conda-spark/
• Security: Kerberos
PySpark
17
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
PySpark: NLTK
18
def word_tokenize(x):
import nltk
return nltk.word_tokenize(x)
def pos_tag(x):
import nltk
return nltk.pos_tag([x])
words = data.flatMap(word_tokenize)
pos_word = words.map(pos_tag)
pos_word.take(5)
[[('Address', 'NN')],
[('on', 'IN')],
[('the', 'DT')],
[('State', 'NNP')],
[('of', 'IN')]]
PySpark: Image processing GPUs
19
Jupyter Notebook:
https://gist.github.com/danielfrg/0afee63072793e9e9d6ebaee27865a4a
PySpark: Scikit-learn: spark-sklearn
20
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(),
param_grid=param_grid)
gs.fit(X, y)
More info: https://github.com/databricks/spark-sklearn
PySpark: Scikit-learn: spark-sklearn
21
More info: https://github.com/databricks/spark-sklearn
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from spark_sklearn import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(),
param_grid=param_grid)
gs.fit(X, y)
PySpark: Dataframes
22
Spark Dataframes
Performance happy
Future / Alternatives
23
• Apache Arrow
• Tensorflow
• Dask
Apache Arrow
24
More info:
- https://github.com/databricks/spark-sklearn
- https://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/
TensorFlow
25
Dask
26
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Python parallel computing library for analytics
pip install dec2 OR conda install -c conda-forge dec2
dec2 up --keyname YOUR-AWS-KEY-NAME
--keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem
--count 9 # Provision nine nodes
--nprocs 8 # Use eight separate worker processes per node
More info: http://dask.pydata.org/en/latest/
CONNECTING PYTHON TO THE
SPARK ECOSYSTEM
Questions?
Spark Summit 2016:
Daniel Rodriguez
Software developer/data scientist
Continuum Analytics
Twitter: @danielfrg
Github: github.com/danielfrg

More Related Content

What's hot

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDatabricks
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Takuya UESHIN
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 

What's hot (20)

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 

Viewers also liked

Embracing the digital challenge - change in the newsrooms of brazilian newspa...
Embracing the digital challenge - change in the newsrooms of brazilian newspa...Embracing the digital challenge - change in the newsrooms of brazilian newspa...
Embracing the digital challenge - change in the newsrooms of brazilian newspa...Vitor Conceicao
 
Saving enough to stop work
Saving enough to stop workSaving enough to stop work
Saving enough to stop workHenry Tapper
 
Embi cri review-2012-final
Embi cri review-2012-finalEmbi cri review-2012-final
Embi cri review-2012-finalPeter Embi
 
金融危機的餘波蕩樣
金融危機的餘波蕩樣金融危機的餘波蕩樣
金融危機的餘波蕩樣chenshay
 
Peter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi
 
Basisferdigheter på jobben for truckfører
Basisferdigheter på jobben for truckførerBasisferdigheter på jobben for truckfører
Basisferdigheter på jobben for truckførertrinefarmen
 
How QuickBase Sync Will Change the Way You Solve Problems
How QuickBase Sync Will Change the Way You Solve ProblemsHow QuickBase Sync Will Change the Way You Solve Problems
How QuickBase Sync Will Change the Way You Solve ProblemsQuickBase, Inc.
 
Hr System of Mobilink Pakistan
Hr System of Mobilink Pakistan Hr System of Mobilink Pakistan
Hr System of Mobilink Pakistan Waqas Ghuman
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performanceJisc
 
Brand reputation e community building: team di lavoro coesi per community fel...
Brand reputation e community building: team di lavoro coesi per community fel...Brand reputation e community building: team di lavoro coesi per community fel...
Brand reputation e community building: team di lavoro coesi per community fel...Eufemia Scannapieco
 
Transmedia Storytelling
Transmedia StorytellingTransmedia Storytelling
Transmedia StorytellingOmar Vite
 
Design thinking. Crash Course.
Design thinking. Crash Course.Design thinking. Crash Course.
Design thinking. Crash Course.Omar Vite
 
Socially Integrated Dynamic Newsrooms. The NEW newsroom.
Socially Integrated Dynamic Newsrooms. The NEW newsroom.Socially Integrated Dynamic Newsrooms. The NEW newsroom.
Socially Integrated Dynamic Newsrooms. The NEW newsroom.Michael Iwasaki
 
Shrinking the Custom Application Development Cycle with Low-Code Platforms
Shrinking the Custom Application Development Cycle with Low-Code PlatformsShrinking the Custom Application Development Cycle with Low-Code Platforms
Shrinking the Custom Application Development Cycle with Low-Code PlatformsQuickBase, Inc.
 
Buzz community management 2017
Buzz community management 2017Buzz community management 2017
Buzz community management 2017REALIZ
 

Viewers also liked (18)

Embracing the digital challenge - change in the newsrooms of brazilian newspa...
Embracing the digital challenge - change in the newsrooms of brazilian newspa...Embracing the digital challenge - change in the newsrooms of brazilian newspa...
Embracing the digital challenge - change in the newsrooms of brazilian newspa...
 
Saving enough to stop work
Saving enough to stop workSaving enough to stop work
Saving enough to stop work
 
Embi cri review-2012-final
Embi cri review-2012-finalEmbi cri review-2012-final
Embi cri review-2012-final
 
金融危機的餘波蕩樣
金融危機的餘波蕩樣金融危機的餘波蕩樣
金融危機的餘波蕩樣
 
Peter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-ReviewPeter Embi's 2011 AMIA CRI Year-in-Review
Peter Embi's 2011 AMIA CRI Year-in-Review
 
Basisferdigheter på jobben for truckfører
Basisferdigheter på jobben for truckførerBasisferdigheter på jobben for truckfører
Basisferdigheter på jobben for truckfører
 
How QuickBase Sync Will Change the Way You Solve Problems
How QuickBase Sync Will Change the Way You Solve ProblemsHow QuickBase Sync Will Change the Way You Solve Problems
How QuickBase Sync Will Change the Way You Solve Problems
 
Hr System of Mobilink Pakistan
Hr System of Mobilink Pakistan Hr System of Mobilink Pakistan
Hr System of Mobilink Pakistan
 
Mbe19 Macro G.6
Mbe19 Macro G.6Mbe19 Macro G.6
Mbe19 Macro G.6
 
Challenges in end-to-end performance
Challenges in end-to-end performanceChallenges in end-to-end performance
Challenges in end-to-end performance
 
Brand reputation e community building: team di lavoro coesi per community fel...
Brand reputation e community building: team di lavoro coesi per community fel...Brand reputation e community building: team di lavoro coesi per community fel...
Brand reputation e community building: team di lavoro coesi per community fel...
 
Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4
 
Transmedia Storytelling
Transmedia StorytellingTransmedia Storytelling
Transmedia Storytelling
 
It6601 mobile computing unit 3
It6601 mobile computing unit 3It6601 mobile computing unit 3
It6601 mobile computing unit 3
 
Design thinking. Crash Course.
Design thinking. Crash Course.Design thinking. Crash Course.
Design thinking. Crash Course.
 
Socially Integrated Dynamic Newsrooms. The NEW newsroom.
Socially Integrated Dynamic Newsrooms. The NEW newsroom.Socially Integrated Dynamic Newsrooms. The NEW newsroom.
Socially Integrated Dynamic Newsrooms. The NEW newsroom.
 
Shrinking the Custom Application Development Cycle with Low-Code Platforms
Shrinking the Custom Application Development Cycle with Low-Code PlatformsShrinking the Custom Application Development Cycle with Low-Code Platforms
Shrinking the Custom Application Development Cycle with Low-Code Platforms
 
Buzz community management 2017
Buzz community management 2017Buzz community management 2017
Buzz community management 2017
 

Similar to Spark Summit 2016: Connecting Python to the Spark Ecosystem

Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierNVIDIA
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 

Similar to Spark Summit 2016: Connecting Python to the Spark Ecosystem (20)

PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Data Science
Data ScienceData Science
Data Science
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 

Recently uploaded

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Spark Summit 2016: Connecting Python to the Spark Ecosystem

  • 1. CONNECTING PYTHON TO THE SPARK ECOSYSTEM Daniel Rodriguez Software developer/data scientist Continuum Analytics Twitter: @danielfrg Github: github.com/danielfrg Spark Summit 2016:
  • 2. Content • PyData and Spark • Python (PyData libs) package management • Python package management in a cluster • Multiple options • Some usage of python (single node) libraries in a cluster • Future 2
  • 3. PyData ecosystem 3 Image by Jake VanderPlas
  • 4. PyData ecosystem: Numpy 4 • Numpy as the core for the ecosystem: • Powerful N-dimensional array object • Broadcasting functions • Tools for integrating C/C++ and Fortran code • Linear algebra, Fourier transform, and random number capabilities • Single node • Python only
  • 5. Spark Ecosystem 5 Image (c) Cloudera, Inc Third party: spark-packages.org
  • 6. Spark ecosystem 6 • Fast and general engine for large-scale data processing • High level libraries: • Dataframes, MLlib, SQL, Graphs • Similar as PyData • Distributed by design: RDD as core • JVM with multi language interfaces: Scala, Java, Python, R
  • 7. PyData vs Spark 7 • PyData for scientists - maybe they also use MPI • Spark for a Big Data (after MapReduce) • Both relatively fine and happy • Scientist have to scale • Big data people have to do data science, statistics, ML • Data scientists now do both
  • 8. PySpark issues 8 • PySpark source is not very pythonic (imo) • Packages • Performance: • Serialization and pipes • RDD of pickle objects • Py4J to communicate • Driver and each worker: java <-> python
  • 9. PyData package management 9 • Numpy and other have C/C++ and Fortran code • Compiling. Windows github.com/conda/conda +Conda Conda-forge: Community powered packages
  • 10. Package management in a cluster 10 • Lot (100s to 1000s) nodes • Not really a new problem in DevOps or CM Search on github for available conda modules
  • 11. Cloudera Manager and Anaconda Parcel 11 • Most popular Hadoop distribution • Anaconda parcel for CDH *: • Static python distribution • Super easy installation if you have CDH *: Joint work between Cloudera and Continuum More info: - https://docs.continuum.io/anaconda/cloudera - http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh
  • 12. I don't use Cloudera 12 You manage your own Hadoop cluster? For real? Want to make this happen? Call me maybe?
  • 13. Anaconda for cluster management 13 • Agnostic: Cloud, on premise, Air-gap • Dynamic package and environment management • Based on salt • Extra plugins: Jupyter Notebook and more More info: https://docs.continuum.io/anaconda-cluster/index
  • 14. Leverage Spark and YARN 14 • New ways to deploy environments: • Sparkonda: https://github.com/moutai/sparkonda
  • 15. Leverage Spark: Sparkonda 15 conda create -n sparkonda-test-env python=2.7 pip pandas scikit-learn numpy numba source activate sparkonda-test-env pip install sparkonda sc.addPyFile('path/to/sparkonda_utils.py') skon.CONDA_ENV_NAME = 'sparkonda-test-env' skon.CONDA_ENV_LOCATION = ''.join([home_dir,'/miniconda/envs/',skon.CONDA_ENV_NAME]) skon.SC_NUM_EXECUTORS = 2 skon.SC_NUM_CORE_PER_EXECUTOR = 2 skon.pack_conda_env() skon.distribute_conda_env(sc) skon.list_cwd_files(sc) skon.install_conda_env(sc) skon.set_workers_python_interpreter(sc) def check_pandas(x): import pandas as pd; return [pd.__version__] skon.prun(sc, check_pandas, include_broadcast_vars=False) More info: https://github.com/moutai/sparkonda
  • 16. Leverage Spark and YARN 16 • Beta • Conda + Spark: http://quasiben.github.io/blog/ 2016/4/15/conda-spark/ • Security: Kerberos
  • 17. PySpark 17 lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)
  • 18. PySpark: NLTK 18 def word_tokenize(x): import nltk return nltk.word_tokenize(x) def pos_tag(x): import nltk return nltk.pos_tag([x]) words = data.flatMap(word_tokenize) pos_word = words.map(pos_tag) pos_word.take(5) [[('Address', 'NN')], [('on', 'IN')], [('the', 'DT')], [('State', 'NNP')], [('of', 'IN')]]
  • 19. PySpark: Image processing GPUs 19 Jupyter Notebook: https://gist.github.com/danielfrg/0afee63072793e9e9d6ebaee27865a4a
  • 20. PySpark: Scikit-learn: spark-sklearn 20 from sklearn import grid_search, datasets from sklearn.ensemble import RandomForestClassifier from sklearn.grid_search import GridSearchCV digits = datasets.load_digits() X, y = digits.data, digits.target param_grid = {"max_depth": [3, None], "max_features": [1, 3, 10], "min_samples_split": [1, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [True, False], "criterion": ["gini", "entropy"], "n_estimators": [10, 20, 40, 80]} gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid) gs.fit(X, y) More info: https://github.com/databricks/spark-sklearn
  • 21. PySpark: Scikit-learn: spark-sklearn 21 More info: https://github.com/databricks/spark-sklearn from sklearn import grid_search, datasets from sklearn.ensemble import RandomForestClassifier from spark_sklearn import GridSearchCV digits = datasets.load_digits() X, y = digits.data, digits.target param_grid = {"max_depth": [3, None], "max_features": [1, 3, 10], "min_samples_split": [1, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [True, False], "criterion": ["gini", "entropy"], "n_estimators": [10, 20, 40, 80]} gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid) gs.fit(X, y)
  • 23. Future / Alternatives 23 • Apache Arrow • Tensorflow • Dask
  • 24. Apache Arrow 24 More info: - https://github.com/databricks/spark-sklearn - https://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/
  • 26. Dask 26 import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute() Python parallel computing library for analytics pip install dec2 OR conda install -c conda-forge dec2 dec2 up --keyname YOUR-AWS-KEY-NAME --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem --count 9 # Provision nine nodes --nprocs 8 # Use eight separate worker processes per node More info: http://dask.pydata.org/en/latest/
  • 27. CONNECTING PYTHON TO THE SPARK ECOSYSTEM Questions? Spark Summit 2016: Daniel Rodriguez Software developer/data scientist Continuum Analytics Twitter: @danielfrg Github: github.com/danielfrg