Weitere ähnliche Inhalte Ähnlich wie Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A Step-by-Step Approach for Acceleration and Innovation (20) Mehr von Revolution Analytics (20) Kürzlich hochgeladen (20) Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A Step-by-Step Approach for Acceleration and Innovation2. IBM Netezza with Revolution Analytics Revolution Confidential
High-performance, in-database analytics platform for Big Data
– Massively parallel processing delivers 10-100x performance
– Run analytics in-database and eliminate data movement
– Scalable architecture fosters experimentation
Innovation with Advanced Analytics
– Analytic modeling with most current statistical methods and 2,500+
open source packages
Enterprise ready advanced analytics software, services &
support
– Security, IDE, training, professional services
– Web Services stack enables integration with front-end
presentation layer
2 © 2012 IBM Corporation
4. What is R? Revolution Confidential
Download the White Paper
R is Hot
Data analysis software bit.ly/r-is-hot
A programming language
– Development platform designed by and for statisticians
– Object-oriented: vector, matrix, model, …
– Built-in libraries of algorithms
An environment
– Huge library of algorithms for data access, data manipulation, analysis
and graphics
An open-source software project
– Free, open, and active
A community
– Thousands of contributors, 2 million users
– Resources and help in every domain
4 © 2012 IBM Corporation
5. Revolution Confidential
Most advanced statistical
analysis software available
The professor who invented analytic software for
Half the cost of the experts now wants to take it to the masses
commercial alternatives
2M+ Users
Power
2,500+ Applications
Finance
Statistics
Life Sciences
Predictive Manufacturing
Analytics Productivity
Retail
Data Mining Telecom Enterprise
Social Media Readiness
Visualization
Government
5
6. R evolution R E nterpris e has the Open-
S ourc e R E ngine at the c ore Revolution Confidential
2,500 community packages and growing exponentially
Multi-Threaded Technology Web Services Big Data Parallel
Math Libraries Partners API Analysis Tools
Revolution
Technical Productivity
Support Environment
Open Source R Build
Packages R Engine Assurance
Language Libraries
6
9. In-Database Paradigms for using R Revolution Confidential
Examples
In-database Scoring
– Family of apply functions which score – Customer lifetime value
analytic models by using data – Credit score
parallelism – Affinity
– Underlying truism is that there is a fact – Good stock/bad stock
that can be applied across all data
Big Data Analytics Big data analytics
– Family of parallelized, in-database – Clustering of all data to determine
analytics that have R wrappers and groupings
work on entire data set – Models that are apply across a whole
– Underlying truism exists across all data set – decision trees
data – Data transformation – variable
selection, correlation
Grouped by Row (tapply) Group
– Data and Task Parallelism – Forecasting – by store, stock symbol,
• Data flow technique to apply analytics to etc.
naturally occurring groups of data using – Build model for each customer or
non-parallelized analytics product or etc.
– Underlying relationship in data is by a
group
9 © 2012 IBM Corporation
11. Open Source R Package Support Revolution Confidential
Horizontal Vertical
• Bayesian • Econometrics
• Cluster • Experimental Design
• Distributions • Computational
Physics
• Graphics
• Clinical Trials 2500+
• Graphical Models
• Environmetrics
• Machine Learning community
• Finance
• Multivariate packages
• Genetics
• Natural Language
Processing • Medical Imaging
• Optimization • Pharmacokinetics
• Robust Statistical • Phylogenetics
Metrics • Psychometrics
• Spatial • Social Sciences
• Survival Analysis
• Time Series
11 © 2012 IBM Corporation
12. Using Revolution R Enterprise with IBM NetezzaConfidential
Revolution
Business Intelligence, Excel
or Third-Party Application
HTTP
RevoDeployR Server
Web Services Interface for R
Revolution R Enterprise - Workstation Revolution R Enterprise - Server
RODBC
R Packages integrate and RODBC
& push analytics processing &
nzODBC in-database nzODBC
IBM Netezza Analytics Host
IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics
S-Blade S-Blade S-Blade S-Blade S-Blade
12 © 2012 IBM Corporation
13. Deploying Revolution R Enterprise to IBM Netezza
Revolution Confidential
•Remote terminal connection to Host
•Create your R Script
•Compile and Register your R Script as an AE (UDAP)
•Execute SQL that will invoke the registered AE
•Go back Revolution R Client to retrieve results and continue
additional analysis
IBM Netezza Analytics Host
IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics IBM Netezza Analytics
S-Blade S-Blade S-Blade S-Blade S-Blade
13 © 2012 IBM Corporation
14. Revolution R Enterprise Client Configuration
Revolution Confidential
Revolution R Enterprise R Package Dependencies
– Productivity Environment – RODBC
– caTools
– Tree
– Bitops
– E1071
– Rgl
– Ca
– MASS
– XML
Netezza ODBC Drivers
‘nz’ R Packages
– nzA, nzR, nzMatrix
14 © 2012 IBM Corporation
15. IBM Netezza In-Database Analytics from Revolution R Confidential
Revolution
nzR nzA nzMatrix
Package Package Package
Encapsulation of Matrices
Encapsulate database and Entry point to the and operations in Database
expose “R”-like constructs nzAnalytics
nz.matrix construct in
R to access matrices in the
database
R data.frame =
database table Explicitly parallelized
Apply an R function to a row algorithms that run in R operations on
of data or grouped rows of database nz.matrix translate to
data matrix stored procedure
operations
15 © 2012 IBM Corporation
16. nzR Package Revolution Confidential
Basic Functions Sample Code
Database Connection nzConnect #load packages
nzConnectDSN
library(nzr)
SQL Execution nzQuery,
nzScalarQuery #connect to a database via ODBC
nzDeleteTable nzConnect("admin", "xyz", "127.0.0.1", "iclasstest")
Data Management as.nz.data.frame
#load the iris table
nz.data.frame
nzdf <- nz.data.frame("iris")
Apply an R function nzApply
nzTApply #run a nzTApply against the nz dataframe
nzGroupedApply fun <- function(x) max(x[,1])
R Package Management nzInstallPackages nzTApply(nzdf, nzdf[,5], fun)
nzIsPackageInstalled
16 © 2012 IBM Corporation
17. nzA Package Revolution Confidential
Data Manipulation
Moments nz.moments
Quantiles nz.quantile, nz.quartile
Outlier Detection nz.outliers
Frequency Table nz.bitable
Histogram nz.hist
Pearson's Correlation nz.corr
Spearman's Correlation nz.spearman.corr, nz.spearman.corr.s
Covariance nz.cov, nz.cov.matrix
Mutual Information nz.mutualinfo
Chi-Square Test nzChisq.test, nz.chisq.test
t -Test t.ls.test, t.me.test, t.pmd.test, t.umd.test
Mann-Whitney-Wilcoxon Test nz.mww.test
Wilcoxon Test nz.wilcoxon.test
Canonical Correlation nz.canonical.corr
One-Way ANOVA nzAnova, nz.anova.CRD.test, nz.anova.RBD.test
Principal Component Analysis nzPCA
Tree-Shaped Bayesian Networks nz.TBNet Apply, nz.TBNet Grow, nz.BigBNControl,
nz.TBNet1g2p, nz.TBNet1g,nz.TBNet2g
17 © 2012 IBM Corporation
18. nzA Package Revolution Confidential
Data Transformations
Discretization nz.efdisc, nz.emdisc, nz.ewdisc
Standardization and Normalization nz.std.norm
Data Imputation nz.impute.data
Model Diagnostics
Misclassification Error nz.cerror
Confusion Matrix nz.acc, nz.CMATRIX STATS
Mean Absolute Error nz.mae
Mean Square Error nz.mse
Relative Absolute Error nz.rae
Percentage Split nz.percentage.split
Cross-Validation nz.cross.validation
18 © 2012 IBM Corporation
19. nzA Package Revolution Confidential
Classification Clustering
Naive Bayes nzNaiveBayes, K-Means Clustering nzKMeans, nz.kmeans,
nz.naivebayes, nz.predict.kmeans
nz.predict.naivebayes Divisive Clustering nz.divcluster,
Decision Trees nzDecTree, nz.predict.divcluster
nz.dectree,
nz.grow.dectree,
nz.print.dectree,
nz.prune.dectree,
nz.predict.dectree
Nearest Neighbors nz.knn
Associative Rule Mining
Regression FP-Growth nz.fpgrowth,
Linear Regression nzLm nz.prepare.fpgrowth
Regression Trees nzRegTree,
nz.regtree,
nz.grow.regtree,
nz.print.regtree,
nz.predict.regtree
19 © 2012 IBM Corporation
20. nzMatrix Package Revolution Confidential
Data Manipulation
Coerce or point to a nz.matrix as.nz.matrix, as.nz.matrix.matrix, nz.matrix
Combine Matrices nzCBind, nzRBind
Create Matrices From Tables nzCreateMatrixFromTable, nzCreateTableFromMatrix
Create Special Matrices nzIdentityMatrix, nzNormalMatrix, nzOnesMatrix,
nzRandomMatrix, nzVecToDiag
Decomposition nzSVD, svd, nzEigen
Delete Matrices nzDeleteMatrix, nzDeleteMatrixByName
Dimensions dim, NCOL, ncol, NROW, nrow
Mathematical Functions abs, add, aubtr, ceiling, div, exp, floor, ln, log10, mod,
mult, nzPowerMatrix, pow, rounding, sqrt, trunc
Matrix Engine Initialization nzMatrixEngineInitialization
Matrix Info is.nz.matrix, isSparse, nzExistMatrix, nzExistMatrixByName,
nzGetValidMatrixName
Operators *, +, -, <, ==, >, nzKronecker, nzPMax, nzPMin, nzSetValue,
[, scale, t
Printing Matrices print.nz.matrix
Solve nzInv, nzSolve, nzSolveLLS
Sparse Matrices isSparse, nzSparse2matrix
Summaries
nzAll, nzAny, nzMax, nzMin, nzSsq, nzSum, nzTr
20 © 2012 IBM Corporation
21. Demonstration
Using Revolution R
with IBM Netezza
March 1, 2012 © 2012 IBM Corporation
23. Us e C as e – C redit R is k Revolution Confidential
We have a dataset comprised of individuals
and their credit risk
stored on the Netezza Appliance
The goal is to model if someone is
“approvable” for a loan.
This use case will follow a modeling process
(though condensed) from start to finish.
I will discuss each of the parts and at the end
there will be a demo of the code
24. Modeling E xerc is e Revolution Confidential
1. Learning more about the data
2. Prepare the data for modeling
3. Fit models to the data
4. Model Performance
25. 1. L earning more about the data Revolution Confidential
Connect to the IBM Netezza appliance
Summarize the data
Visualize the data
Continuous Variable Discrete Varible
300
300
250
250
Frequency
200
200
150
150
100
100
50
50
0
0
0 5 10 15 20 25 High School Diploma Bachelors Degree Masters Degree Professional Degree PhD
x
26. 2. P repare the data for modeling Revolution Confidential
Split the data in to 70/30 Training/Test sets
Transform some variables
Discretize numeric variables for later use
27. 3. F it models to the data Revolution Confidential
Build two different models to predict if an
individual is “approvable”
Decision Tree
Naïve Bayes
28. 4. Model P erformanc e Revolution Confidential
Examine confusion matrices to determine:
Training performance
Test performance
29. Demo Revolution Confidential
30. Summary
Familiar environment for R Developers
– World-class productivity tools
– Enterprise class service, support and integration
Execution of analytics in-database
– Analytic computing distributed across Netezza nodes and run
in a massively parallel manner
– Each Netezza node gets a data slice and analytics are pushed
down from the Host to the individual nodes
Capabilities
– R Code executed on Netezza nodes in row-by-row fashion or
on groups of rows
– Enables access to explicitly parallelized algorithms running on
entire data set
– Large-scale parallel matrix operations on database tables
Performance
– 10-100x Performance improvements
9 © 2012 IBM Corporation
31. C ontac t Us Revolution Confidential
Bill Zanine
Business Solutions Executive, Analytics Solutions
IBM Netezza
wzanine@us.ibm.com
Derek Norton
Solutions Executive
Revolution Analytics
derek.norton@revolutionanalytics.com
www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR