R is a popular statistical programming language used for data analysis and machine learning. It has over 3 million users and is taught widely in universities. While powerful, R has some scaling limitations for big data. Several Apache Spark integrations with R like SparkR and sparklyr enable distributed, parallel processing of large datasets using R on Spark clusters. Other options for scaling R include H2O for in-memory analytics, Microsoft ML Server for on-premises scaling, and ScaleR for portable parallel processing across platforms. These solutions allow R programs and models to be trained on large datasets and deployed for operational use on big data in various cloud and on-premises environments.
2. ⢠Introduction to R
⢠Benefits and challenges
⢠R in Apache Spark: Distributed computing
⢠R in Databases: In-DB intelligence
Slideshare.net
3. ⢠3+M users
⢠Taught in most universities
⢠Thriving user groups worldwide
⢠5th in 2016 IEEE Spectrum rank
⢠~40% pro analysts prefer R (highest amongst R, SAS, python)
⢠10,000+ contributed packages
⢠Many common use cases across industry
⢠Rich application & platform integration
What is
⢠The most popular statistical & ML programming language
⢠A data visualization tool
⢠Open source
Language
Platform
Community
Ecosystem
3
4. R Adoption is on a tear
76% of analytic
professionals use R
36% select R as
their primary tool
R Usage Growth
Rexer Data Miner Survey 2007-2015
2016 IEEE Spectrum rank
9. 9
Data processing and modeling with SparkR
MLlib: Apache Spark's scalable machine learning library
10. sparklyr: R interface for Apache Spark
Source: http://spark.rstudio.com/
⢠Easy installation from CRAN
⢠Loads data into SparkDataFrame from:
local R data frames, Hive tables, CSV,
JSON, and Parquet files.
⢠Connect to both local instances of
Spark and remote Spark clusters
10
11. dplyr and ML in sparklyr
⢠Includes 3 family of ML functions for machine learning pipeline
⢠ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
⢠K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
⢠ft_*: Feature transformers for manipulating individual features.
⢠sdf_*: Functions for manipulating SparkDataFrames.
⢠Provides a complete dplyr backend for data manipulation and
analysis
%>%
11
12. h2o: prediction engine in R
http://www.h2o.ai/product/
⢠Open source ML platform
⢠Optimized for âin memoryâ distributed, parallel ML
⢠Data manipulation and modeling on H2OFrame:
R functions + h2o pre-fixed functions.
⢠Transformations: h2o.group_by(), h2o.impute()
⢠Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
⢠Algorithms: h2o.glm(), h2o.naiveBayes(),
h2o.deeplearning(), h2o.kmeans(), ...
⢠rsparkling package: h2o on Spark
⢠Provides bindings to h2oâs machine learning
algorithms: extension package for sparklyr
⢠Simple data conversion: SparkDataFrame ->
H2OFrame
12
https://github.com/h2oai/rsparkling
13. ML Server 9.x: Scale-out R
⢠100% compatible with open source R
⢠Virtually any code/package that works today with R will work in ML Server.
⢠Ability to parallelize any R function
⢠Ideal for parameter sweeps, simulation, scoring.
⢠Wide range of scalable and distributed rx pre-fixed functions in
RevoScaleR package.
⢠Transformations: rxDataStep()
⢠Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()âŚ
⢠Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()âŚ
⢠Parallelism: rxSetComputeContext()
13
15. ScaleR library: parallel and portable for Big Data
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server.
ScaleR algorithms work inside
multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
15
16. Write once - deploy anywhere (WODA)
ScaleR: Portable across multiple platforms â local, Spark, SQL-Server, etc.
Models can be trained in one and deployed in another
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
mySparkCC <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(mySparkCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(âairline_20MM.xdfâ,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
### Linear model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- âlocalparâ
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
AirlineDataSet <- RxXdfData(âairline_20MM.xdfâ,
fileSystem = linuxFS)
Local Parallel processing - Linux or Windows In â Spark
Compute
context R script
- sets where the
model will run
Functional model
R script â does
not need to
change to run in
Spark
16
17. Spark clusters in Azure HDInsight
⢠Provisions Azure compute
resources with Spark 2.1
installed and configured.
⢠Supports multiple versions
(e.g. Spark 1.6).
⢠Stores data in Azure Blob
storage (WASB), Azure Data
Lake Store or Local HDFS.
17
18. ML Server Spark cluster architecture
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
R R R R R
R R R R R
ML Server
Data in Distributed Storage
R process on Edge Node
18
19. Model deployment using ML Server
operationalization services (mrsdeploy)
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
ď§ In-cloud or on-prem
ď§ Adding nodes to scale
ď§ High availability & load balancing
ď§ Remote execution server
Microsoft ML Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
19
24. ML Server on Hadoop/HDInsight scales to hundreds of
nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
25. Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support
CRAN R1 Single machines Community
SparkR Single + Distributed
computing
X Community
sparklyr Single + Distributed
computing
X Community
h2o Single + Distributed
computing
X X Community
RevoScaleR Single + Distributed
computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed
25
tinyurl.com/Strata2017R
https://aka.ms/kdd2017r
30. For Oracle In DB analytics, see: https://www.oracle.com/database/advanced-
analytics/index.html
31. In-database machine learning
Develop Train Deploy Consume
Develop, explore and
experiment in your favorite
IDE
Train models with
sp_execute_external_
script and save the
models in database
Deploy your ML scripts
with sp_execute_external_
script and predict using the
models
Make your app/reports
intelligent by consuming
predictions
31
32. Eliminate data movement
Operationalize ML scripts and models
Enterprise grade performance and scale
SQL Transformations
Relational data
Analytics library
32