Data analysis in R for beginners

A fast pace beginners course for those new to R and who want a quick broad demonstration of R under the hood. We cover data ingestion, manipulating data frames, data summary and exploration, interactive visualization, creating dashboards, predictive modelling, and big data integrations.

  1. 1. Data Analysis in . for Beginners Alton Alexander Data Science Consultant
  2. 2. Why R? • R is open source – like python not like SAS • Out of the box R is single machine, in memory statistical computing engine – Download from https://www.r-project.org/ • Use an IDE – R Studio https://www.rstudio.com/ – Revolution Analytics (MSFT) – Jython (ipython)
  3. 3. R studio Download Overview
  4. 4. Essential Learning Resources A new book for learning R Q: What have you tried and what works?
  5. 5. Topics • Data ingestion • Manipulation • Summary and exploration • Writing Reports • Interactive visualization and dashboarding • Predictive Modeling & Forecasting • Big Data Integrations
  6. 6. Demo Options data R studio
  7. 7. Data ingestion • Load data – Load.csv() – library(RJDBC) – library(RODBC)
  8. 8. Data Structures and Manipulation • Another major reason for using R – Ability to work with data in Data Frames – Like pandas in python and data tables in SAS • Reasons for doing data manipulation (munging) – Feature extraction – ETL – Data cleansing – Pivots, stack/unstack, aggregate, groupby, reshape
  9. 9. Set Theory SQL joins and their results merge, sqldf in R http://www.r-bloggers.com/manipulating- data-frames-using-sqldf-a-brief-overview/
  10. 10. Summary and Exploration • Powerful summary functions for programmatically quantifying datasets • Functions include: – Summary(), hist(), levels(), aggregate()
  11. 11. Interactive Visualization and Dashboarding • Shiny from Rstudio • Like tableau – Local and server options • Much more customizable, more coding, no GUI or click to edit • But you can bring in powerful libraries to build web apps comparatively fast
  12. 12. Predictive Modeling & Forecasting • Examples – Customer segmentation • Unsupervised classification – Marketing mix models • Explain the coefficients – Attribution modeling • Supervised time series of events – Multivariate testing • (AB tests with statistical significance, ANOVA) – Lead scoring • P2B Models, topic of interest, propensity to buy, expected spend
  13. 13. 5 Libraries for Machine Learning Allowing the machine to capture complexity: 1. gbm [Gradient Boosting Machine] 2. randomForest [Random Forest] 3. e1071 [Support Vector Machines] Taking advantage of high-cardinality categorical or text-data: 4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models] 5. tau [Text Analysis Utilities]
  14. 14. Big Data Integration • Single laptop is often sufficient – Millions of rows on a 32GB i7 laptop • Scale using a larger server – Often sufficient but has limitations (100s of GB) • Clustered compute engine – Algorithm considerations to affect performance
  15. 15. RServer • For datasets that don’t fit in memory or for convenience there is a SERVER option – A shared compute engine – Shares resources – Think +100 GB of RAM
  16. 16. Big Data Integration - Frameworks • H2O.ai • SparkR • Revolution Analytics • In DB processing – Applying lead score or segmentation model in real time – Spark, teradata, vertica
  17. 17. Why R? In High Demand Nationally
