2. Who am I?
Joseph K. Bradley
Ph.D. in Machine Learning from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
3. 3
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source
5. Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
5
Machine Learning Pipelines
Simple construction and tuning of ML workflows
7. DataFrames
7
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns
8. DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs
9. Spark DataFrames
9
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
10. 10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer
18. Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
18
Machine Learning Pipelines
Simple construction and tuning of ML workflows
19. About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
19
20. About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Frequent itemsets
• FP-growth
20
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
• One-Hot Encoder
• StringIndexer
• VectorIndexer
• VectorAssembler
• Binarizer
• Bucketizer
• ElementwiseProduct
• PolynomialExpansion
List based on upcoming release 1.4
21. ML Workflows are complex
21
Train model
Evaluate
Load data
Extract features
22. ML Workflows are complex
22
Train model
Evaluate
Datasource 1
Extract features
Datasource 2
Datasource 2
23. ML Workflows are complex
23
Train model
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
24. ML Workflows are complex
24
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
25. ML Workflows are complex
25
Specify pipeline
Inspect & debug
Re-run on new data
Tune parameters
26. Example: Text Classification
26
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive
31. Extract Features
31
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
Transformer
DataFrame
DataFrame
32. Train a Model
32
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Estimator
DataFrame
Model
33. Evaluate the Model
33
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Evaluator
DataFrame
metric
34. Data Flow
34
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
By default, always
append new columns
Can go back & inspect
intermediate results
Made efficient by
DataFrames
47. Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 47
Composable & DAG
Pipelines
Schema validation
User-defined Pipeline
components
48. Looking Ahead
48
Spark 1.4
• Spark R
• Pipelines graduating from
alpha
• Many more feature
transformers
• More complete Python API
Future
• API for R DataFrames &
Pipelines
• More ML algorithms &
pluggability
• Improved model inspection
Learn more next week
at the Spark Summit!
spark-summit.org/2015
49. Databricks Inc.
49
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
50. Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org
For those coming from Hadoop, this is a huge improvement: simpler code, runs on a laptop and on a huge cluster, very efficient.
Can you spot the bug in the code using the RDD API?
Contributions estimated from github commit logs, with some effort to de-duplicate entities.
Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
*Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
TODO: Include schema validation in the demo? (Select wrong columns to pass to Pipeline.fit().)
No time to mention:
User-defined functions (UDFs)
Optimizations: code gen, predicate pushdown