A fast pace beginners course for those new to R and who want a quick broad demonstration of R under the hood. We cover data ingestion, manipulating data frames, data summary and exploration, interactive visualization, creating dashboards, predictive modelling, and big data integrations.
2. Why R?
• R is open source – like python not like SAS
• Out of the box R is single machine, in memory
statistical computing engine
– Download from https://www.r-project.org/
• Use an IDE
– R Studio https://www.rstudio.com/
– Revolution Analytics (MSFT)
– Jython (ipython)
8. Data Structures and Manipulation
• Another major reason for using R
– Ability to work with data in Data Frames
– Like pandas in python and data tables in SAS
• Reasons for doing data manipulation (munging)
– Feature extraction
– ETL
– Data cleansing
– Pivots, stack/unstack, aggregate, groupby, reshape
9. Set Theory
SQL joins and
their results
merge, sqldf in R
http://www.r-bloggers.com/manipulating-
data-frames-using-sqldf-a-brief-overview/
10. Summary and Exploration
• Powerful summary functions for
programmatically quantifying datasets
• Functions include:
– Summary(), hist(), levels(), aggregate()
11. Interactive Visualization
and Dashboarding
• Shiny from Rstudio
• Like tableau
– Local and server options
• Much more customizable, more coding, no GUI or
click to edit
• But you can bring in powerful libraries to build
web apps comparatively fast
12. Predictive Modeling & Forecasting
• Examples
– Customer segmentation
• Unsupervised classification
– Marketing mix models
• Explain the coefficients
– Attribution modeling
• Supervised time series of events
– Multivariate testing
• (AB tests with statistical significance, ANOVA)
– Lead scoring
• P2B Models, topic of interest, propensity to buy, expected spend
13. 5 Libraries for Machine Learning
Allowing the machine to capture complexity:
1. gbm [Gradient Boosting Machine]
2. randomForest [Random Forest]
3. e1071 [Support Vector Machines]
Taking advantage of high-cardinality categorical or text-data:
4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]
5. tau [Text Analysis Utilities]
14. Big Data Integration
• Single laptop is often sufficient
– Millions of rows on a 32GB i7 laptop
• Scale using a larger server
– Often sufficient but has limitations (100s of GB)
• Clustered compute engine
– Algorithm considerations to affect performance
15. RServer
• For datasets that don’t fit in memory or for
convenience there is a SERVER option
– A shared compute engine
– Shares resources
– Think +100 GB of RAM
16. Big Data Integration - Frameworks
• H2O.ai
• SparkR
• Revolution Analytics
• In DB processing
– Applying lead score or
segmentation model in
real time
– Spark, teradata, vertica
18. Get Alton’s FREE Reports!
Go to http://frontanalysis.com/bigdatameetup/
Complete the survey including your email
I’ll email you the two reports:
1. Anonymized Summary of the Survey
2. LinkedIn Job Suggestions for a Utah Data Scientist