SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Analytics Beyond RAM Capacity
using R
Dr. Alex Palamides
Athens Big Data Meetup
September 19th 2017
Page | 2
• R is important not because it's language certainly it's a language absolutely directly tailored to the needs of predictive
analytics but it is used in a huge growing community
• so R is much more than a language it is
• is a language
• ecosystem
• Community
• and a vast array techniques that data scientists can draw from in solving new problems
Athens Big Data – Modelling beyond RAM capacity with R
Introduction to R
Page | 3
Athens Big Data– Modelling beyond RAM capacity with R
Introduction to R
R ranks #5 in IEEE Spectrum 2016 while two years ago R was ranking 9th
Page | 4
Athens Big Data– Modelling beyond RAM capacity with R
Introduction to R
• Open source statistical programming language based upon “S”
• R is one of the most popular data science tools (along with Python)
• The base functionality can be expanded using “packages”
• The usage of R has dramatically increased over recent years:
• Popular with educational and
research communities
• Known to be used at many of the
leading tech firms (Airbnb,
Facebook, Google, Twitter, Uber,
etc.)
• R Consortium support from
Google, IBM, Microsoft, Oracle,
etc.
• Microsoft purchase of Revolutions
Analytics (R Open, R Server, SQL
Server, AzureML)
• RStudio is a popular (IDE) for R / R
Tools for Visual Studio
Page | 5
Athens Big Data – Modelling beyond RAM capacity with R
Introduction to R – Data handling / visualization
• Common file formats are easily read into R
– library(data.table), fread(…) for CSV or text files (as an alternative to
read.csv(…))
– library(readxl) for Excel
– library(haven) for SAS datasets
• Access and submit SQL queries using ODBC and library(dplyr)
• Data is usually stored in a data.frame
object
Two main packages are used for
processing data in R
– library(dplyr) uses action verbs to act
upon data frames
– library(data.table) is faster and more
powerful however the syntax is more
challenging to learn
• library(ggplot2) is a very popular
graphics package for R
Page | 6
Athens Big Data – Modelling beyond RAM capacity with R
Introduction to R – Model Building
• glm(…) to bulit generalized models . Commonly
use for logistic Regression
• Use step(…) to execute stepwise regression
• lm for linear regression
• Rpart (…) for CART trees
• RandomForest (…) for RF Trees
• Knn(…) for K- Nearest Neighbourhood
• Nnet(…) for neural networks
• rcorr.cens(…) for Gini Coefficient
• caret::R2 for R squared and model
• tuning And so on…
• In general there are multiple ways to create
models due to the open source community
Page | 7
Athens Big Data – Modelling beyond RAM capacity with R
R– Limitations and Solutions
• In-Memory Operation
• Lack of Parallelism
• Expensive Data Movement &
Duplication
Couple of scalable R solutions:
• Choose R packages with big data support on single machines
• The “bigmemory” project
• “ff” and related packages
• Scale from single machines to distributed computing
• SparkR
• sparklyr
• RevoScaleR (Microsoft R Server)
and more!
Page | 8
Athens Big Data – Modelling beyond RAM capacity with R
R– Limitations and Solutions
MSR is a family of R based products that live both independently and inside a SQL server
database and other platforms. They give users a multiplicity of methods by which take Data in
the organization , they apply predictive analytics to develop learning and insight and deploy that
directly as applications usable by the business on which they can take direct action
Page | 9
Athens Big Data – Modelling beyond RAM capacity with R
Core Idea
• Microsoft R Server (MSR) on the other hand by utilizing RevoScaleR package
capabilities follows a different approach; Datasets are stored on the disk and
computations are performed into chunks of data, therefore data is
inherently distributed
• In the MSR most common data operations (manipulation and analysis) are
supported by counterpart functions in addition to the support in (indirectly)
utilizing open-source R algorithms
Page | 10
Athens Big Data – Modelling beyond RAM capacity with R
Intro description
• 100% compatible with open source R
Any code/package that works today with R will work in R Server.
• Ability to parallelize any R function
Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed “rx” pre-fixed functions in “RevoScaleR” package.
Transformations: rxDataStep()
Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
Parallelism: rxSetComputeContext()
Page | 11
Athens Big Data – Modelling beyond RAM capacity with R
Microsoft R Open ( the Interpreter)
Increased Performance and
scalability through
parallelization and streaming
Page | 12
Athens Big Data – Modelling beyond RAM capacity with R
Microsoft R Server Components
Page | 13
R Open – Traditional Connection to a DB
Page | 14
Scale through Parallelization
In R Server also typical data can be pulled from the source.
Overcoming most of these limitations by increased performance due to
parallelized computations
Faster implementation of algorithms as they are written in c++
As operations take place one block at ta time , no need to place all data
in RAM. Results are updated chunk by chunk
Page | 15
Scale through Parallelization
Page | 16
Athens Big Data – Modelling beyond RAM capacity with R
Microsoft R Server
Page | 17
Athens Big Data – Modelling beyond RAM capacity with R
Functions rebuilt to ensure high performance.
Page | 18
State-of-the-Art Machine Learning Algorithms :
rxFastTrees: An implementation of FastRank, an efficient implementation of the MART gradient boosting
algorithm.
rxFastForest: A random forest and Quantile regression forest implementation using rxFastTrees.
rxLogisticRegression: Logistic regression using L-BFGS.
rxOneClassSvm: One class support vector machines.
rxNeuralNet: Binary, multi-class, and regression neural net.
rxFastLinear: Stochastic dual coordinate ascent optimization for linear binary classification and regression.
rxPredict.mlModel: Scores using a model created by one of the machine learning algorithms.
Helper functions for arguments: loss functions :expLoss/ logLoss /et all, Kernel functions linearKernel /
rbfKernel et Text Tools:
featurizeText: language detection, tokenization, stopwords removing, text normalization and feature
generation
categorical transforms with dictionary, feature selection from specified variables and other
Athens Big Data – Modelling beyond RAM capacity with R
Microsoft ML package
Page | 19
Athens Big Data – Modelling beyond RAM capacity with R
RevoscaleR Performance Benchmarking
When it comes to scaling , performance is
critical . An example of the performance
improvements available for users of
Microsoft R :
• the blue bar is a depiction of the
performance rate and the data size
capacity of the open source are
generalized linear model it's commonly
used ; logistic regression
algorithm.
• The red bar is the effect that Microsoft
is able to bring by creating an
equivalent algorithm that is massively
paralyzed , remote executable and
most importantly rewritten in C++
maximize the performance of the
algorithm
• as you can see there's about forty to
one difference in the performance rate
• Essentially no limit to the scalability of
the scale: Runtime increases almost
linearly with the data size
Page | 20
Coding Example ( local Execution)
Page | 21
Remote Execution
Page | 22
Remote Execution Architecture
• Faster Computation
• Larger Data Sets
• Fewer Security concerns
Page | 23
Coding Example ( Remote Execution)
Page | 24
Athens Big Data – Modelling beyond RAM capacity with R
Getting Started (0)
Microsoft R is a collection of packages, interpreters, and infrastructure for
developing and deploying R-based machine learning and data science solutions
on a range of platforms
• Microsoft R Server is the flagship product and supports very large workloads
in the enterprise.
• Microsoft R Client is a free workstation version. It includes the same R Server
functionality, but for local workloads.
• Microsoft R Open is Microsoft's distribution of open source R, without the
proprietary packages and infrastructure of our other products. This R
distribution is included in both Microsoft R Client and R Server.
• Student / Developer Free Version available
• Supported in several platforms:
• R Server for Hadoop
• R Server for Linux
• R Server for Windows
• R Server for Teradata
• R Server for Azure
• Embedded in SQL Server 2017 as R Services
Page | 25
• Data import and exploration :
• mysource <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv") ## point to the source file
• airXdfData <- rxImport(inData=mysource) # import the file as data frame- be careful of data limitation issues#
• airXdfData <- rxImport(inData=mysource, outFile="c:/Users/Temp/airExample.xdf") # import as XDF , i.e., store the
# file in the hard drive
• An .xdf file is a binary file format native to Microsoft R, used for persisting data on disk. An .xdf file is column-based,
one column per variable, which is optimum for the variable orientation of data used in statistics and predictive .
includes precomputed metadata that is immediately available with no additional processing analytics.
• Examine object metadata:
• rxGetInfo(airXdfData, getVarInfo = TRUE) #
• For example : rxGetInfo(airXdfData) results in :
Variable information: Var 1: ArrDelay 702 factor levels: 6 -8 -2 1 -14 ... 451 430 597 513 432
Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833)
Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
• Summarize data :
• rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data=airXdfData)
• Number of valid observations:
• Statistics (mean , StdDev etc) for numerical variables
• Counts per level for categorial variables
• rxSummary(formula = ~ArrDelay:DayOfWeek, data=airXdfData) # Statistics by category
Getting Started (1)
Page | 26
• Data transformations - One function to rule them all :
• All transformations are performed via the rxDataStep command
• airXdfData <- rxDataStep(inData = airXdfData, outFile = "c:/Users/Temp/airExample.xdf",
transforms=list(VeryLate = (ArrDelay > 120 | is.na(ArrDelay))), overwrite = TRUE)
• A full RevoScaleR data step consists of the following steps:+
• Read in the data a block (200,000 rows) at a time.
• For each block, pass the ArrDelay data to the R interpreter for processing the transformation to
create VeryLate.
• Write the data out to the dataset a block at a time. The argument overwrite=TRUE allows us to
overwrite the data file.
• From XDF to data frame :
• myData <- rxDataStep(inData = airXdfData, rowSelection = ArrDelay > 240 & ArrDelay <= 300, varsToKeep
= c("ArrDelay", "DayOfWeek"))
• Subseting :
• rxReadXdf(airXdfData, numRows=10, startRow=100000)
• Visualizations :
• rxHistogram(~ArrDelay, data = myData)
Getting Started (2)
Page | 27
• Modelling :
• rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airXdfData) # simple linear regression
• arrDelayLm3 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime), data = airXdfData, cube = TRUE)
# interaction and on the fly factor conversion
• myTab <- rxCrossTabs(ArrDelay~DayOfWeek, data = airLateDS) ## Tabulation by gou
• logitObj <- rxLogit(Late~DepHour + Night, data = airExtraDS) logistic regression
• predictDS <- rxPredict(modelObject = logitObj, data = airExtraDS, outData = airExtraDS) #
Predictions from model objects
• Modelling Using a Compute Cluster :
• myCluster <- RxSparkConnect(nameNode = "my-name-service-server", port = 8020)
• rxSetComputeContext(myCluster)
• With your compute context set to the cluster, all of the RevoScaleR data analysis functions automatically
distribute computations across the nodes of the cluster
• delayCarrierLocDist <- rxLinMod(ArrDelay ~ UniqueCarrier+Origin+Dest, data = dataFile, cube = TRUE,
blocksPerRead = 30) # # Regresssion runs in the clusters
• rxSetComputeContext("local")) # reset compute context back to the local machine
• And many more applicable in various operation types: RxXdf Data, rxFactors rxSplit, rxMerge, rxRocCurve
, rxDTree ,, rxKmeans ,rxDForest , rxNaiveBayes , RxHadoopMR , RxSpark , RxInTeradata , RxInSqlServer ,
RxLocalParallel , rxExec
Getting Started (3)
Page | 28
Thank you!
Analytics Beyond RAM Capacity using R Athens Big Data
Meetup September 19th 2017
Dr. Alex Palamides
www.linkedin.com/in/alex-palamides
palamid@gmail.com
Contact Details

Weitere ähnliche Inhalte

Was ist angesagt?

How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 

Was ist angesagt? (20)

Confluent building a real-time streaming platform using kafka streams and k...
Confluent   building a real-time streaming platform using kafka streams and k...Confluent   building a real-time streaming platform using kafka streams and k...
Confluent building a real-time streaming platform using kafka streams and k...
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
 
Data Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache KafkaData Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache Kafka
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
 
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
Kafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePayKafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePay
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 

Ähnlich wie Analytics Beyond RAM Capacity using R

Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 

Ähnlich wie Analytics Beyond RAM Capacity using R (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 

Kürzlich hochgeladen

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Kürzlich hochgeladen (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Analytics Beyond RAM Capacity using R

  • 1. Analytics Beyond RAM Capacity using R Dr. Alex Palamides Athens Big Data Meetup September 19th 2017
  • 2. Page | 2 • R is important not because it's language certainly it's a language absolutely directly tailored to the needs of predictive analytics but it is used in a huge growing community • so R is much more than a language it is • is a language • ecosystem • Community • and a vast array techniques that data scientists can draw from in solving new problems Athens Big Data – Modelling beyond RAM capacity with R Introduction to R
  • 3. Page | 3 Athens Big Data– Modelling beyond RAM capacity with R Introduction to R R ranks #5 in IEEE Spectrum 2016 while two years ago R was ranking 9th
  • 4. Page | 4 Athens Big Data– Modelling beyond RAM capacity with R Introduction to R • Open source statistical programming language based upon “S” • R is one of the most popular data science tools (along with Python) • The base functionality can be expanded using “packages” • The usage of R has dramatically increased over recent years: • Popular with educational and research communities • Known to be used at many of the leading tech firms (Airbnb, Facebook, Google, Twitter, Uber, etc.) • R Consortium support from Google, IBM, Microsoft, Oracle, etc. • Microsoft purchase of Revolutions Analytics (R Open, R Server, SQL Server, AzureML) • RStudio is a popular (IDE) for R / R Tools for Visual Studio
  • 5. Page | 5 Athens Big Data – Modelling beyond RAM capacity with R Introduction to R – Data handling / visualization • Common file formats are easily read into R – library(data.table), fread(…) for CSV or text files (as an alternative to read.csv(…)) – library(readxl) for Excel – library(haven) for SAS datasets • Access and submit SQL queries using ODBC and library(dplyr) • Data is usually stored in a data.frame object Two main packages are used for processing data in R – library(dplyr) uses action verbs to act upon data frames – library(data.table) is faster and more powerful however the syntax is more challenging to learn • library(ggplot2) is a very popular graphics package for R
  • 6. Page | 6 Athens Big Data – Modelling beyond RAM capacity with R Introduction to R – Model Building • glm(…) to bulit generalized models . Commonly use for logistic Regression • Use step(…) to execute stepwise regression • lm for linear regression • Rpart (…) for CART trees • RandomForest (…) for RF Trees • Knn(…) for K- Nearest Neighbourhood • Nnet(…) for neural networks • rcorr.cens(…) for Gini Coefficient • caret::R2 for R squared and model • tuning And so on… • In general there are multiple ways to create models due to the open source community
  • 7. Page | 7 Athens Big Data – Modelling beyond RAM capacity with R R– Limitations and Solutions • In-Memory Operation • Lack of Parallelism • Expensive Data Movement & Duplication Couple of scalable R solutions: • Choose R packages with big data support on single machines • The “bigmemory” project • “ff” and related packages • Scale from single machines to distributed computing • SparkR • sparklyr • RevoScaleR (Microsoft R Server) and more!
  • 8. Page | 8 Athens Big Data – Modelling beyond RAM capacity with R R– Limitations and Solutions MSR is a family of R based products that live both independently and inside a SQL server database and other platforms. They give users a multiplicity of methods by which take Data in the organization , they apply predictive analytics to develop learning and insight and deploy that directly as applications usable by the business on which they can take direct action
  • 9. Page | 9 Athens Big Data – Modelling beyond RAM capacity with R Core Idea • Microsoft R Server (MSR) on the other hand by utilizing RevoScaleR package capabilities follows a different approach; Datasets are stored on the disk and computations are performed into chunks of data, therefore data is inherently distributed • In the MSR most common data operations (manipulation and analysis) are supported by counterpart functions in addition to the support in (indirectly) utilizing open-source R algorithms
  • 10. Page | 10 Athens Big Data – Modelling beyond RAM capacity with R Intro description • 100% compatible with open source R Any code/package that works today with R will work in R Server. • Ability to parallelize any R function Ideal for parameter sweeps, simulation, scoring. • Wide range of scalable and distributed “rx” pre-fixed functions in “RevoScaleR” package. Transformations: rxDataStep() Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()… Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()… Parallelism: rxSetComputeContext()
  • 11. Page | 11 Athens Big Data – Modelling beyond RAM capacity with R Microsoft R Open ( the Interpreter) Increased Performance and scalability through parallelization and streaming
  • 12. Page | 12 Athens Big Data – Modelling beyond RAM capacity with R Microsoft R Server Components
  • 13. Page | 13 R Open – Traditional Connection to a DB
  • 14. Page | 14 Scale through Parallelization In R Server also typical data can be pulled from the source. Overcoming most of these limitations by increased performance due to parallelized computations Faster implementation of algorithms as they are written in c++ As operations take place one block at ta time , no need to place all data in RAM. Results are updated chunk by chunk
  • 15. Page | 15 Scale through Parallelization
  • 16. Page | 16 Athens Big Data – Modelling beyond RAM capacity with R Microsoft R Server
  • 17. Page | 17 Athens Big Data – Modelling beyond RAM capacity with R Functions rebuilt to ensure high performance.
  • 18. Page | 18 State-of-the-Art Machine Learning Algorithms : rxFastTrees: An implementation of FastRank, an efficient implementation of the MART gradient boosting algorithm. rxFastForest: A random forest and Quantile regression forest implementation using rxFastTrees. rxLogisticRegression: Logistic regression using L-BFGS. rxOneClassSvm: One class support vector machines. rxNeuralNet: Binary, multi-class, and regression neural net. rxFastLinear: Stochastic dual coordinate ascent optimization for linear binary classification and regression. rxPredict.mlModel: Scores using a model created by one of the machine learning algorithms. Helper functions for arguments: loss functions :expLoss/ logLoss /et all, Kernel functions linearKernel / rbfKernel et Text Tools: featurizeText: language detection, tokenization, stopwords removing, text normalization and feature generation categorical transforms with dictionary, feature selection from specified variables and other Athens Big Data – Modelling beyond RAM capacity with R Microsoft ML package
  • 19. Page | 19 Athens Big Data – Modelling beyond RAM capacity with R RevoscaleR Performance Benchmarking When it comes to scaling , performance is critical . An example of the performance improvements available for users of Microsoft R : • the blue bar is a depiction of the performance rate and the data size capacity of the open source are generalized linear model it's commonly used ; logistic regression algorithm. • The red bar is the effect that Microsoft is able to bring by creating an equivalent algorithm that is massively paralyzed , remote executable and most importantly rewritten in C++ maximize the performance of the algorithm • as you can see there's about forty to one difference in the performance rate • Essentially no limit to the scalability of the scale: Runtime increases almost linearly with the data size
  • 20. Page | 20 Coding Example ( local Execution)
  • 21. Page | 21 Remote Execution
  • 22. Page | 22 Remote Execution Architecture • Faster Computation • Larger Data Sets • Fewer Security concerns
  • 23. Page | 23 Coding Example ( Remote Execution)
  • 24. Page | 24 Athens Big Data – Modelling beyond RAM capacity with R Getting Started (0) Microsoft R is a collection of packages, interpreters, and infrastructure for developing and deploying R-based machine learning and data science solutions on a range of platforms • Microsoft R Server is the flagship product and supports very large workloads in the enterprise. • Microsoft R Client is a free workstation version. It includes the same R Server functionality, but for local workloads. • Microsoft R Open is Microsoft's distribution of open source R, without the proprietary packages and infrastructure of our other products. This R distribution is included in both Microsoft R Client and R Server. • Student / Developer Free Version available • Supported in several platforms: • R Server for Hadoop • R Server for Linux • R Server for Windows • R Server for Teradata • R Server for Azure • Embedded in SQL Server 2017 as R Services
  • 25. Page | 25 • Data import and exploration : • mysource <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv") ## point to the source file • airXdfData <- rxImport(inData=mysource) # import the file as data frame- be careful of data limitation issues# • airXdfData <- rxImport(inData=mysource, outFile="c:/Users/Temp/airExample.xdf") # import as XDF , i.e., store the # file in the hard drive • An .xdf file is a binary file format native to Microsoft R, used for persisting data on disk. An .xdf file is column-based, one column per variable, which is optimum for the variable orientation of data used in statistics and predictive . includes precomputed metadata that is immediately available with no additional processing analytics. • Examine object metadata: • rxGetInfo(airXdfData, getVarInfo = TRUE) # • For example : rxGetInfo(airXdfData) results in : Variable information: Var 1: ArrDelay 702 factor levels: 6 -8 -2 1 -14 ... 451 430 597 513 432 Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833) Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday • Summarize data : • rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data=airXdfData) • Number of valid observations: • Statistics (mean , StdDev etc) for numerical variables • Counts per level for categorial variables • rxSummary(formula = ~ArrDelay:DayOfWeek, data=airXdfData) # Statistics by category Getting Started (1)
  • 26. Page | 26 • Data transformations - One function to rule them all : • All transformations are performed via the rxDataStep command • airXdfData <- rxDataStep(inData = airXdfData, outFile = "c:/Users/Temp/airExample.xdf", transforms=list(VeryLate = (ArrDelay > 120 | is.na(ArrDelay))), overwrite = TRUE) • A full RevoScaleR data step consists of the following steps:+ • Read in the data a block (200,000 rows) at a time. • For each block, pass the ArrDelay data to the R interpreter for processing the transformation to create VeryLate. • Write the data out to the dataset a block at a time. The argument overwrite=TRUE allows us to overwrite the data file. • From XDF to data frame : • myData <- rxDataStep(inData = airXdfData, rowSelection = ArrDelay > 240 & ArrDelay <= 300, varsToKeep = c("ArrDelay", "DayOfWeek")) • Subseting : • rxReadXdf(airXdfData, numRows=10, startRow=100000) • Visualizations : • rxHistogram(~ArrDelay, data = myData) Getting Started (2)
  • 27. Page | 27 • Modelling : • rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airXdfData) # simple linear regression • arrDelayLm3 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime), data = airXdfData, cube = TRUE) # interaction and on the fly factor conversion • myTab <- rxCrossTabs(ArrDelay~DayOfWeek, data = airLateDS) ## Tabulation by gou • logitObj <- rxLogit(Late~DepHour + Night, data = airExtraDS) logistic regression • predictDS <- rxPredict(modelObject = logitObj, data = airExtraDS, outData = airExtraDS) # Predictions from model objects • Modelling Using a Compute Cluster : • myCluster <- RxSparkConnect(nameNode = "my-name-service-server", port = 8020) • rxSetComputeContext(myCluster) • With your compute context set to the cluster, all of the RevoScaleR data analysis functions automatically distribute computations across the nodes of the cluster • delayCarrierLocDist <- rxLinMod(ArrDelay ~ UniqueCarrier+Origin+Dest, data = dataFile, cube = TRUE, blocksPerRead = 30) # # Regresssion runs in the clusters • rxSetComputeContext("local")) # reset compute context back to the local machine • And many more applicable in various operation types: RxXdf Data, rxFactors rxSplit, rxMerge, rxRocCurve , rxDTree ,, rxKmeans ,rxDForest , rxNaiveBayes , RxHadoopMR , RxSpark , RxInTeradata , RxInSqlServer , RxLocalParallel , rxExec Getting Started (3)
  • 28. Page | 28 Thank you! Analytics Beyond RAM Capacity using R Athens Big Data Meetup September 19th 2017 Dr. Alex Palamides www.linkedin.com/in/alex-palamides palamid@gmail.com Contact Details

Hinweis der Redaktion

  1. DevelopeR IDE Deploy R is essentially a web services gateway that allows users to expose a web service through witch Bi tools , custom applications can invoce R scripts , run r modeles and retrivee results without knowing that they even call R. So R does not have to be installed in platforms that is not needed. ConnectR gives access to a variety of data sources like SAS or SPSS files and the ability to save files in a format called XDF which provides high degree of compression as well as fast data retrieval when needed. Distrubuted R also plays a supportive role for scaleR is normalizaion layer thath provides an abstract interface on top of whih the scale R algorithm can operate Te core is the algorithm provided by scaleR layer . Redesigned , written in c++ ,to provide parallelized compuation , loa data one bloak at atime , so to combat the memory constarints that we have refeered before. To provide the ability to execute algorithms in remote systems
  2. Pull data into R, ##load it into memory and run analysis. Lets see an example of a simple data pulling and analysis script. Extacted from the DB . High movement time Ram constraints in the enviroment of analysis Duplicating data betwwen analzed and BD versions causes typical problems
  3. In R Server also typical data can be pulled form the source. Overcoming most of these limitations by increased erformance to to paralllelized computations Faster implementation of algorithms as they are written in c++ As they operation takes place one block at ta time , no need to palce all data in RAM. Results are updated chunk by chunk
  4. Due to revoscaleR , the script remains simple, just a simple call But inside the Distributeed R compnent is utilized to identify how many cores, and threats are availiable , and allocate portions of work to each of these available resources. Analyze data in chunks# use of XDF FORMAT high performance : names stems form extrernal data format. It it typically 5 times smaller than a csv containing the same dataset. No parsing is required so the retrieval time is reduced significantly
  5. Open Source R – includes a number of enhacements and adaptations to provide the abaility to scale up in entrprise class level . Run R at speed in platforms like hadoop . Bulit scripts in one platformm- run and operotionalize In another . Thus write locally – run in the cloud. R server is pleload- prebuilt in the cloud Operaionalize means set something based n R . E.g. a scoring algorihm and expose those interfaces via web services to e consumed by all types of BI tools and applications
  6. Instead of pulling data , and do the work in house , there is the ability to push he work to the data repository. By the utilizing the remote excution capabilities
  7. Instead of running te linear rgresion locally, aramaeters are packaged in tub and passed this request object to the remote system. The remote system starts the master process , , and only the results are returned to the script. So : No data movement Platform independent work