2. Integrating R and Hadoop Part of Revolution Analyticsâ Big Analytics Strategy Contact us at info@revolutionanalytics.com 2
3. Outline Introduction to Revolution Analytics Opportunity and Challenges of Big Analytics Revolution Analyticsâ Support of Integration between R and Hadoop Contact Info
6. 2,500+ ApplicationsFinance Statistics Life Sciences Predictive Analytics Manufacturing Retail Data Mining Telecom Social Media Visualization Government
8. Big Analytics, Big Advantages Big Analytics could be Simple algorithms running on âBig Dataâ Compute-intensive algorithms running on either âBig Dataâ or small data sets Advanced Analytic routines for data visualization or statistical analysis
9. Extracting Value with Big Analytics Big Analyticsâ Advantages Predict the Future Understand Risk and Uncertainty Embrace Complexity Identify the Unusual Think Big 7
10. Big Analytics Challenges Computations are data intensive (i.e. require large amounts of data) To be effective, must rely on data parallelism Data is distributed across compute nodes Same task is run in parallel on each of the data partitions Examples of distributed computing frameworks that support data parallelism Traditional file based analytics using on-premise clusters Hadoop and MapReduce In-Database Analytics using parallel hardware architectures 8
11. Key Objectives for Big Analytics Deployments Best performance is achieved when these Big Analytics challenges are overcome: Avoid sampling / aggregation; Reduce data movement and replication; Bring the analytics as close as possible to the data and; Optimize computation speed. Revolution Analyticsâ support for R and Hadoop helps overcome these challenges
12. Revolution Analyticsâ RevoConnectRsfor Hadoop RevoHDFS provides connectivity from R to HDFS and RevoHBase Allows an R programmer to manipulate Hadoop data stores directly from HDFS and HBASE RevoHStream allows MapReduce jobs to be developed in R and executed as Hadoop Streaming jobs Gives R programmers the ability to write MapReduce jobs in R using Hadoop Streaming
13.
14. Hadoop Streaming package for executing MapReduce jobs from R.R Map Reduce Task Tracker Task Node R Client Job Tracker
15. RevoHDFS R package for working with HDFS Connect and Browse HDFS Read/Write/Delete/Copy/Rename files Examples: Read an HDFS text file into a data frame Serialize a data frame to HDFS Stream lines from HDFS text file that can be used with biglm or bigglm 12
16. RevoHBase R Package for working with HBASE Connect and Browse HBASE Get Rows/Columns of an HBASE table Write data to HBASE table Create/Delete HBASE table Examples Create a data frame in R from a collection of Rows/Columns from HBASE Update an HBASE table with values from a data frame 13
17. RevoHStream RevoHStream â R package capable of performing the following types of Analysis using Hadoop Streaming Simulations - Monte Carlo and other Stochastic analysis R âapplyâ family of operations (tapply, lapplyâŠ) Binning, quantiles, summaries and crosstabs for input to displays (ggplot, lattice). Data transformations Data Mining 14
18. Example MapReduce AlgorithmLogistic Regresion ## create test set as follows ## rhwrite(lapply (1:100, function(i) {eps = rnorm(1, sd =10) ; keyval(i, list(x = c(i,i+eps), y = 2 * (eps > 0) - 1))}), "/tmp/logreg") ## run as: ## rhLogisticRegression("/tmp/logreg", 10, 2, 0.05) ## max likelihood solution diverges for separable dataset, (-inf, inf) such as the above rhLogisticRegression = function(input, iterations, dims, alpha){ plane = rep(0, dims) g = function(z) 1/(1 + exp(-z)) for (i in 1:iterations) { gradient = rhread(revoMapReduce(input, map = function(k, v) keyval (1, v$y * v$x * g(-v$y * (plane %*% v$x))), reduce = function(k, vv) keyval(k, apply(do.call(rbind,vv),2,sum)), combine = T)) plane = plane + alpha * gradient[[1]]$val } plane } 15
19. Get more information about Revolution Analyticsâ Big Analytics Solutions, including R connectors for Hadoop 1 855-GET-REVO 16 http://www.revolutionanalytics.com/big-analytics