In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?
We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark.
Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages.
Python
R
Apache Spark
ML
DL
3. Agenda
⢠Intro to Spark, PySpark, SparkR
⢠ML with Spark
⢠Data Science PySpark
â ML Pipeline API
â Integrating with packages (tensorframe, BigDL, âŚ)
⢠Data Science SparkR
â ML API
â Integrating with packages (svm, randomForest,
tensorflowâŚ)
11. SparkR
⢠R language APIs for Spark
⢠Exposes Spark functionality in a R dplyr-like APIs
⢠Runs as its own REPL sparkR
⢠or as a R package loaded in IDEs like RStudio
library(SparkR)
sparkR.session()
12. Architecture
⢠Native R classes and methods
⢠RBackend
⢠Scala wrapper/helper (eg. ML Pipeline)
www.slideshare.net/SparkSummit/07-venkataraman-sun
13. High Performance
⢠JVM processing, full access to DAG capabilities
and Catalyst optimizer, predicate pushdown, code
generation, etc.
databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
24. Classification:
Logistic regression
Binomial logistic regression
Multinomial logistic regression
Decision tree classifier
Random forest classifier
Gradient-boosted tree classifier
Multilayer perceptron classifier
One-vs-Rest classifier
(a.k.a. One-vs-All)
Naive Bayes
Models
Clustering:
K-means
Latent Dirichlet allocation (LDA)
Bisecting k-means
Gaussian Mixture Model (GMM)
Collaborative Filtering:
Alternating Least Squares (ALS)
Regression:
Linear regression
Generalized linear regression
Decision tree regression
Random forest regression
Gradient-boosted tree regression
Survival regression
Isotonic regression
25. PySpark ML Pipline Model
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),
outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
prediction = model.transform(test)
26. Large Scale, Distributed
⢠2012 paper âLarge Scale Distributed Deep Networksâ
Jeff Dean et al
⢠Model parallelism
⢠Data parallelism
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf
27. Spark as a Scheduler
⢠Run external, often native packages
⢠Data-parallel tasks
⢠Distributed execution external data
28. UDF
⢠Register function to use in SQL
â spark.udf.register("stringLengthInt", lambda x:
len(x), IntegerType())
spark.sql("SELECT stringLengthInt('test')")
⢠Function on RDD
â rdd.map(lambda l: str(l.e)[:7])
â rdd.mapPartitions(lambda x: [(1,), (2,), (3,)])
30. Considerations
⢠Data transfer overhead
(serialization/deserialization, network)
⢠Native language (boxing, interpreted)
⢠Lambda â one row at a time
36. BigDL
⢠Distribute model training
â P2P All Reduce
â block manager as parameter server
⢠Integration with Intel's Math Kernel Library (MKL)
⢠Model Snapshot
⢠Load Caffe/Torch Model
39. mmlspark
⢠Spark ML Pipeline model + rich data types: images, text
⢠Deep Learning with CNTK
â Train on GPU edge node
⢠Transfer Learning
# Initialize CNTKModel and define input and output columns
cntkModel =
CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(model
File)
# Train on dataset with internal spark pipeline
scoredImages = cntkModel.transform(imagesWithLabels)
40. tensorframes
⢠DataFrame -> TensorFlow
⢠Hyperparameter tuning
⢠Apply trained model at scale
⢠Row or block operations
⢠JVM to C++ (bypassing Python)
42. spark-deep-learning
⢠Spark ML Pipeline model + complex data: image
â Transformer
⢠DL featurization
⢠Transfer Learning
⢠Model as SQL UDF
⢠Later : hyperparameter tuning
46. Spark ML Pipeline
⢠Pre-processing, feature extraction, model fitting,
validation stages
⢠Transformer
⢠Estimator
⢠Cross-validation/hyperparameter tuning
47. SparkR API for ML Pipeline
spark.lda(
data = text, k =
20, maxIter = 25,
optimizer = "em")
RegexTokenizer
StopWordsRemover
CountVectorizer
R
JVM
LDA
API
builds
ML Pipeline
48. Model Operations
⢠summary - print a summary of the fitted model
⢠predict - make predictions on new data
⢠write.ml/read.ml - save/load fitted models
(slight layout difference: pipeline model plus R
metadata)
49. RFormula
⢠Specify modeling in symbolic form
y ~ f0 + f1
response y is modeled linearly by f0 and f1
⢠Support a subset of R formula operators
~ , . , : , + , -
⢠Implemented as feature transformer in core Spark,
available to Scala/Java, Python
⢠String label column is indexed
⢠String term columns are one-hot encoded
50. Generalized Linear Model
# R-like
glm(Sepal_Length ~ Sepal_Width + Species,
gaussianDF, family = "gaussian")
spark.glm(binomialDF, Species ~
Sepal_Length + Sepal_Width, family =
"binomial")
⢠âbinomialâ output string label, prediction
52. Random Forest
spark.randomForest(df, Employed ~ ., type
= "regression", maxDepth = 5, maxBins =
16)
spark.randomForest(df, Species ~
Petal_Length + Petal_Width,
"classification", numTree = 30)
⢠âclassificationâ index label, predicted label to string
53. Native-R UDF
⢠User-Defined Functions - custom transformation
⢠Apply by Partition
⢠Apply by Group
UDFdata.frame data.frame
54. Parallel Processing By Partition
R
R
R
Partition
Partition
Partition
UDF
UDF
UDF
data.frame
data.frame
data.frame
data.frame
data.frame
data.frame
55. UDF: Apply by Partition
⢠Similar to R apply
⢠Function to process each partition of a DataFrame
⢠Mapping of Spark/R data types
dapply(carsSubDF,
function(x) {
x <- cbind(x, x$mpg * 1.61)
},
schema)
56. UDF: Apply by Partition + Collect
⢠No schema
out <- dapplyCollect(
carsSubDF,
function(x) {
x <- cbind(x, "kmpg" = x$mpg*1.61)
})
57. Example - UDF
results <- dapplyCollect(train,
function(x) {
model <-
randomForest::randomForest(as.factor(dep_delayed_1
5min) ~ Distance + night + early, data = x,
importance = TRUE, ntree = 20)
predictions <- predict(model, t)
data.frame(UniqueCarrier = t$UniqueCarrier,
delayed = predictions)
})
closure capture -
serialize &
broadcast âtâ
access package
ârandomForest::â at
each invocation
58. UDF: Apply by Group
⢠By grouping columns
gapply(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
},
schema)
59. UDF: Apply by Group + Collect
⢠No Schema
out <- gapplyCollect(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
names(y) <- c("cyl", "max_mpg")
y
})
60. UDF Considerations
⢠No support for nested structures as columns
⢠Scaling up / data skew
⢠Data variety
⢠Performance costs
⢠Serialization/deserialization, data transfer
⢠Package management
61. UDF: lapply
⢠Like R lapply or doParallel
⢠Good for âembarrassingly parallelâ tasks
⢠Such as hyperparameter tuning
62. UDF: lapply
⢠Take a native R list, distribute it
⢠Run the UDF in parallel
UDFelement *anything*
vector/
list
list
63. UDF: parallel distributed processing
⢠Output is a list - needs to fit in memory at the driver
costs <- exp(seq(from = log(1), to = log(1000),
length.out = 5))
train <- function(cost) {
model <- e1071::svm(Species ~ ., iris, cost = cost)
summary(model)
}
summaries <- spark.lapply(costs, train)
64. UDF: model training tensorflow
train <- function(step) {
library(tensorflow)
sess <- tf$InteractiveSession()
...
result <- sess$run(list(merged, train_step),
feed_dict = feed_dict(TRUE))
summary <- result[[1]]
train_writer$add_summary(summary, step)
step
}
spark.lapply(1:20000, train)
65. SparkR as a Package (target:2.2)
⢠Goal: simple one-line installation of SparkR from CRAN
install.packages("SparkR")
⢠Spark Jar downloaded from official release and cached
automatically, or manually install.spark() since Spark 2
⢠R vignettes
⢠Community can write packages that depends on SparkR
package, eg. SparkRext
⢠Advanced Spark JVM interop APIs
sparkR.newJObject, sparkR.callJMethod
sparkR.callJStatic
66. Ecosystem
⢠RStudio sparklyr
⢠RevoScaleR/RxSpark, R Server
⢠H2O R
⢠Apache SystemML (R-like API)
⢠STC R4ML
⢠Apache MxNet (incubating)
67. Recap
⢠Let Spark take care of things or call into packages
⢠Be aware of your partitions!
⢠Many efforts to make that seamless and efficient