SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
TDWI - Accelerate
October 16, 2:30 – 3:15 PM EDT
Hyatt Regency, Bellevue
• Introduction to R
• Benefits and challenges
• R in Apache Spark: Distributed computing
• R in Databases: In-DB intelligence
Slideshare.net
• 3+M users
• Taught in most universities
• Thriving user groups worldwide
• 5th in 2016 IEEE Spectrum rank
• ~40% pro analysts prefer R (highest amongst R, SAS, python)
• 10,000+ contributed packages
• Many common use cases across industry
• Rich application & platform integration
What is
• The most popular statistical & ML programming language
• A data visualization tool
• Open source
Language
Platform
Community
Ecosystem
3
R Adoption is on a tear
76% of analytic
professionals use R
36% select R as
their primary tool
R Usage Growth
Rexer Data Miner Survey 2007-2015
2016 IEEE Spectrum rank
o In-Memory operation
o Lack of implicit parallelism
o Expensive data movement & duplication
6
7
Scaling R on Spark clusters
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
SparkR: R API included with Apache Spark
8
9
Data processing and modeling with SparkR
MLlib: Apache Spark's scalable machine learning library
sparklyr: R interface for Apache Spark
Source: http://spark.rstudio.com/
• Easy installation from CRAN
• Loads data into SparkDataFrame from:
local R data frames, Hive tables, CSV,
JSON, and Parquet files.
• Connect to both local instances of
Spark and remote Spark clusters
10
dplyr and ML in sparklyr
• Includes 3 family of ML functions for machine learning pipeline
• ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
• K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
• ft_*: Feature transformers for manipulating individual features.
• sdf_*: Functions for manipulating SparkDataFrames.
• Provides a complete dplyr backend for data manipulation and
analysis
%>%
11
h2o: prediction engine in R
http://www.h2o.ai/product/
• Open source ML platform
• Optimized for “in memory” distributed, parallel ML
• Data manipulation and modeling on H2OFrame:
R functions + h2o pre-fixed functions.
• Transformations: h2o.group_by(), h2o.impute()
• Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
• Algorithms: h2o.glm(), h2o.naiveBayes(),
h2o.deeplearning(), h2o.kmeans(), ...
• rsparkling package: h2o on Spark
• Provides bindings to h2o’s machine learning
algorithms: extension package for sparklyr
• Simple data conversion: SparkDataFrame ->
H2OFrame
12
https://github.com/h2oai/rsparkling
ML Server 9.x: Scale-out R
• 100% compatible with open source R
• Virtually any code/package that works today with R will work in ML Server.
• Ability to parallelize any R function
• Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed rx pre-fixed functions in
RevoScaleR package.
• Transformations: rxDataStep()
• Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
• Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
• Parallelism: rxSetComputeContext()
13
Free Developer’s version available
14
https://aka.ms/freemrs
ScaleR library: parallel and portable for Big Data
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server.
ScaleR algorithms work inside
multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
15
Write once - deploy anywhere (WODA)
ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc.
Models can be trained in one and deployed in another
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
mySparkCC <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(mySparkCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
### Linear model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = linuxFS)
Local Parallel processing - Linux or Windows In – Spark
Compute
context R script
- sets where the
model will run
Functional model
R script – does
not need to
change to run in
Spark
16
Spark clusters in Azure HDInsight
• Provisions Azure compute
resources with Spark 2.1
installed and configured.
• Supports multiple versions
(e.g. Spark 1.6).
• Stores data in Azure Blob
storage (WASB), Azure Data
Lake Store or Local HDFS.
17
ML Server Spark cluster architecture
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
R R R R R
R R R R R
ML Server
Data in Distributed Storage
R process on Edge Node
18
Model deployment using ML Server
operationalization services (mrsdeploy)
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
 In-cloud or on-prem
 Adding nodes to scale
 High availability & load balancing
 Remote execution server
Microsoft ML Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
19
Prepare/Explore:
OperationalizeModel
Prepare/
Explore
Typical advanced analytics lifecycle
20
21
22
23
scoringFn <- function(newdata){
library(RevoScaleR)
data <- rxImport(newdata)
rxPredict(model, data)
}
ML Server on Hadoop/HDInsight scales to hundreds of
nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support
CRAN R1 Single machines Community
SparkR Single + Distributed
computing
X Community
sparklyr Single + Distributed
computing
X Community
h2o Single + Distributed
computing
X X Community
RevoScaleR Single + Distributed
computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed
25
tinyurl.com/Strata2017R
https://aka.ms/kdd2017r
26
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/StrataSanJose2017
https://learnanalytics.microsoft.com/
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/KDD2017MRS
27
https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server
28
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview
29
For Oracle In DB analytics, see: https://www.oracle.com/database/advanced-
analytics/index.html
In-database machine learning
Develop Train Deploy Consume
Develop, explore and
experiment in your favorite
IDE
Train models with
sp_execute_external_
script and save the
models in database
Deploy your ML scripts
with sp_execute_external_
script and predict using the
models
Make your app/reports
intelligent by consuming
predictions
31
Eliminate data movement
Operationalize ML scripts and models
Enterprise grade performance and scale
SQL Transformations
Relational data
Analytics library
32
Free Developer’s versions available
33
https://aka.ms/sqlserverdeveloper
R services in-database: Data exploration and
predictive modeling (Data Scientist)
34
35
36
EXEC TrainTipPredictionModel
37
38
39
40
41
42
https://docs.microsoft.com/en-us/sql/advanced-analytics/getting-started-with-
machine-learning-services
https://blogs.msdn.microsoft.com/microsoft_press/2016/10/19/fre
e-ebook-data-science-with-microsoft-sql-server-2016/
43
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

Weitere ähnliche Inhalte

Was ist angesagt?

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 

Was ist angesagt? (20)

R server and spark
R server and sparkR server and spark
R server and spark
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 

Ähnlich wie TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Michal MaruĹĄan: Scalable R
Michal MaruĹĄan: Scalable RMichal MaruĹĄan: Scalable R
Michal MaruĹĄan: Scalable R
GapData Institute
 

Ähnlich wie TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta (20)

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Michal MaruĹĄan: Scalable R
Michal MaruĹĄan: Scalable RMichal MaruĹĄan: Scalable R
Michal MaruĹĄan: Scalable R
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Data Science
Data ScienceData Science
Data Science
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 

KĂźrzlich hochgeladen

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

KĂźrzlich hochgeladen (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

  • 1. TDWI - Accelerate October 16, 2:30 – 3:15 PM EDT Hyatt Regency, Bellevue
  • 2. • Introduction to R • Benefits and challenges • R in Apache Spark: Distributed computing • R in Databases: In-DB intelligence Slideshare.net
  • 3. • 3+M users • Taught in most universities • Thriving user groups worldwide • 5th in 2016 IEEE Spectrum rank • ~40% pro analysts prefer R (highest amongst R, SAS, python) • 10,000+ contributed packages • Many common use cases across industry • Rich application & platform integration What is • The most popular statistical & ML programming language • A data visualization tool • Open source Language Platform Community Ecosystem 3
  • 4. R Adoption is on a tear 76% of analytic professionals use R 36% select R as their primary tool R Usage Growth Rexer Data Miner Survey 2007-2015 2016 IEEE Spectrum rank
  • 5. o In-Memory operation o Lack of implicit parallelism o Expensive data movement & duplication
  • 6. 6
  • 7. 7 Scaling R on Spark clusters • What is Spark? • An unified, open source, parallel, data processing framework for Big Data Analytics
  • 8. SparkR: R API included with Apache Spark 8
  • 9. 9 Data processing and modeling with SparkR MLlib: Apache Spark's scalable machine learning library
  • 10. sparklyr: R interface for Apache Spark Source: http://spark.rstudio.com/ • Easy installation from CRAN • Loads data into SparkDataFrame from: local R data frames, Hive tables, CSV, JSON, and Parquet files. • Connect to both local instances of Spark and remote Spark clusters 10
  • 11. dplyr and ML in sparklyr • Includes 3 family of ML functions for machine learning pipeline • ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package. • K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA • ft_*: Feature transformers for manipulating individual features. • sdf_*: Functions for manipulating SparkDataFrames. • Provides a complete dplyr backend for data manipulation and analysis %>% 11
  • 12. h2o: prediction engine in R http://www.h2o.ai/product/ • Open source ML platform • Optimized for “in memory” distributed, parallel ML • Data manipulation and modeling on H2OFrame: R functions + h2o pre-fixed functions. • Transformations: h2o.group_by(), h2o.impute() • Statistics: h2o.summary(), h2o.quantile(), h2o.mean() • Algorithms: h2o.glm(), h2o.naiveBayes(), h2o.deeplearning(), h2o.kmeans(), ... • rsparkling package: h2o on Spark • Provides bindings to h2o’s machine learning algorithms: extension package for sparklyr • Simple data conversion: SparkDataFrame -> H2OFrame 12 https://github.com/h2oai/rsparkling
  • 13. ML Server 9.x: Scale-out R • 100% compatible with open source R • Virtually any code/package that works today with R will work in ML Server. • Ability to parallelize any R function • Ideal for parameter sweeps, simulation, scoring. • Wide range of scalable and distributed rx pre-fixed functions in RevoScaleR package. • Transformations: rxDataStep() • Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()… • Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()… • Parallelism: rxSetComputeContext() 13
  • 14. Free Developer’s version available 14 https://aka.ms/freemrs
  • 15. ScaleR library: parallel and portable for Big Data Stream data into blocks from sources: Hive tables, CSV, Parquet, XDF, ODBC and SQL Server. ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing. 15
  • 16. Write once - deploy anywhere (WODA) ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc. Models can be trained in one and deployed in another ### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ### mySparkCC <- RxSpark() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(mySparkCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1) ### Linear model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem( ) AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = linuxFS) Local Parallel processing - Linux or Windows In – Spark Compute context R script - sets where the model will run Functional model R script – does not need to change to run in Spark 16
  • 17. Spark clusters in Azure HDInsight • Provisions Azure compute resources with Spark 2.1 installed and configured. • Supports multiple versions (e.g. Spark 1.6). • Stores data in Azure Blob storage (WASB), Azure Data Lake Store or Local HDFS. 17
  • 18. ML Server Spark cluster architecture Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes R R R R R R R R R R ML Server Data in Distributed Storage R process on Edge Node 18
  • 19. Model deployment using ML Server operationalization services (mrsdeploy) Data Scientist Developer Easy Integration Easy Deployment Easy Setup  In-cloud or on-prem  Adding nodes to scale  High availability & load balancing  Remote execution server Microsoft ML Server configured for operationalizing R analytics Microsoft R Client (mrsdeploy package) Easy Consumption publishServiceMicrosoft R Client (mrsdeploy package) Data Scientist 19
  • 21. 21
  • 22. 22
  • 23. 23 scoringFn <- function(newdata){ library(RevoScaleR) data <- rxImport(newdata) rxPredict(model, data) }
  • 24. ML Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime Billions of rows Logistic Regression on NYC Taxi Dataset 2.2 TB
  • 25. Base and scalable approaches comparison Approach Scalability Spark Hadoop SQL Server Teradata Support CRAN R1 Single machines Community SparkR Single + Distributed computing X Community sparklyr Single + Distributed computing X Community h2o Single + Distributed computing X X Community RevoScaleR Single + Distributed computing X X X X Enterprise 1. CRAN R indicates no additional R packages installed 25 tinyurl.com/Strata2017R https://aka.ms/kdd2017r
  • 26. 26
  • 30. For Oracle In DB analytics, see: https://www.oracle.com/database/advanced- analytics/index.html
  • 31. In-database machine learning Develop Train Deploy Consume Develop, explore and experiment in your favorite IDE Train models with sp_execute_external_ script and save the models in database Deploy your ML scripts with sp_execute_external_ script and predict using the models Make your app/reports intelligent by consuming predictions 31
  • 32. Eliminate data movement Operationalize ML scripts and models Enterprise grade performance and scale SQL Transformations Relational data Analytics library 32
  • 33. Free Developer’s versions available 33 https://aka.ms/sqlserverdeveloper
  • 34. R services in-database: Data exploration and predictive modeling (Data Scientist) 34
  • 35. 35
  • 36. 36
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42