Apidays New York 2024 - The value of a flexible API Management solution for O...
Big Data Analysis Starts with R
1. R evolution A nalytic s
T he B ig Data A nalytic s R evolution
S tarts with R
Dec ember 20, 2011
1
2. In Today’s Webinar:
About Revolution Analytics
Getting Value with Advanced Analytics
Implementing The Advanced Analytics Stack
Resources and Further Reading
3. Most advanced statistical
analysis software available
The professor who invented analytic software for
Half the cost of the experts now wants to take it to the masses
commercial alternatives
2M+ Users
Power
4,000+ Applications
Finance
Statistics
Life Sciences
Predictive Manufacturing
Analytics Productivity
Retail
Data Mining Telecom Enterprise
Visualization
Social Media Readiness
Government
4. What is R ?
Data analysis software
An open-source
software project
A programming language
A community
An environment
4
5. What’s the Differenc e B etween R and
R evolution R E nterpris e?
Revolution R is 100% R and More®
Multi-Threaded Web-Based Web Services Big Data Parallel
Math Libraries GUI API Analysis Tools
Technical IDE / Developer
Support GUI
4,000+ Community Build
Packages R Engine Assurance
Language Libraries
5
7. E xtrac ting Value with A dvanc ed A nalytic s
Missing the potential value of the data that is
being collected
Need more than counts and averages
Advanced Analytics with Big Data
Predict the Future
Understand Risk and Uncertainty
Embrace Complexity
Identify the Unusual
Think Big
7
8. R : A Unique P latform for E xtrac ting Value from
Data
Data Exploration • R is superior at exploring data to find unexpected trends and
relationships…finding the best predictive models and identify critical
“outliers”, such as clusters of customers who are particularly
and Visualization profitable(or unprofitable!).
• Google, LinkedIn and Facebook, rely on R and the skills of data
scientists who are accustomed to hacking together large data sets
Data Science from disparate sources, visualizing and exploring data to identify
novel modeling techniques, and combining the results of several
modeling strategies to optimize predictive power.
Modeling •Other commercial programs push users through a pre-programmed procedure
and discourages modeling innovation. R was created as a 4GL with the
needs of modern data scientists in mind, with an interactive language that
Innovation promotes data exploration, data visualization, and flexible data modeling.
Talent •R is creating a massive amount of talent because is now the dominant tool of
choice at the universities.
8
10. T he A dvanc ed A nalytic s S tac k
Deployment / Consumption
Advanced Analytics
ETL
Data / Infrastructure
“Open Analytics Stack” White Paper: bit.ly/lC43Kw
10
11. B es t P rac tic es for Implementing an A dvanc ed
A nalytic s S tac k for B ig Data
Limit sampling
Reduce data movement and replication
Bring the analytics as close as possible to
the data
Optimize computation speed – parallel
algorithms
11
12. B ig Data C omputations
Computations are data intensive
To be effective, must rely on data parallelism
Data is distributed across compute nodes
Same task is run in parallel on each of the data partitions
Examples of distributed computing frameworks that
support data parallelism
Traditional file based analytics using on-premise clusters
Hadoop and MapReduce
In-Database Analytics using parallel hardware
architectures
12
13. R evolution R E nterpris e: B ig Data S tatis tic s in R
www.revolutionanalytics.com/bigdata
Every US airline
departure and arrival,
1987-2008
File: AirlineData87to08.xdf
Rows: 123.5 million
Variables: 29
Size on disk: 13.2Gb
arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime),cube=TRUE)
13
14. R evoS c aleR – Dis tributed C omputing
Compute • Portions of the data source are
Data Node made available to each compute
Partition (RevoScaleR) node
• RevoScaleR on the master node
Compute assigns a task to each compute
Data Node node
Partition (RevoScaleR)
Master • Each compute node independently
Node processes its data, and returns its
Compute (RevoScaleR) intermediate results back to the
Data Node master node
Partition (RevoScaleR)
• master node aggregates all of the
intermediate results from each
Compute compute node and produces the
Data Node final result
Partition (RevoScaleR)
14
15. R and Hadoop
Capabilities delivered as individual
HBASE R packages
HDFS
rhdfs - R and HDFS
R
Thrift rhbase - R and HBASE
Map or
Reduce
rmr - R and MapReduce
Task rhbase
rhdfs
Node
Downloads available from
R Client Github
Job
Tracker rmr
15
17. Deployment with R evolution R E nterpris e
End User Desktop Business
Interactive Web
Applications Intelligence
Applications
(i.e. Excel) (i.e. QlikView)
Application
Client libraries (JavaScript, Java, .NET)
Developer
HTTP/HTTPS – JSON/XML
RevoDeployR Web Services
Admin Session Data/Script
Authentication Administration
Management Management
R
R
Programmer R
R
17
18. T hree final thoughts
Now enterprise-ready, R offers innovation
and flexibility needed to meet analytics
challenges in a changing world
R-enabled advanced analytics are key to
unlocking value in big data
Revolution Analytics optimizes R to take
advantage of multiple data management
paradigms and emerging best practices
18
19. R es ourc es
Slides / Replay: bit.ly/r-big-data
“Open Analytics Stack” White Paper: bit.ly/lC43Kw
McKinsey Report on Big Data: bit.ly/jWyrFM
Conway, Data Science Intelligence: bit.ly/myMwak
“Big Analytics” White Paper by Norman H. Nie: bit.ly/biganalytics
Revolution R Enterprise: bit.ly/Enterprise-R
Questions: david.champagne@revolutionanalytics.com
19
20. T hank you.
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR
20