Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

R at Microsoft

51.654 Aufrufe

Veröffentlicht am

Presenter: David Smith
Presented to SURF (Sydney R User Group), June 25 2015

Veröffentlicht in: Technologie
  • Login to see the comments

R at Microsoft

  1. 1. • Introduction to R • Applications of R at Microsoft • R Products at Microsoft • What’s coming for R at Microsoft • Q&A
  2. 2. April 6, 2015 “This acquisition will help customers use advanced analytics within Microsoft data platforms.“
  3. 3. INTRODUCTION TO R
  4. 4. • Most widely used data analysis software • Most powerful statistical programming language • Create beautiful and unique data visualizations • Thriving open-source community • Fills the talent gap www.revolutionanalytics.com/what-is-r
  5. 5. • 1993: Research project in Auckland, NZ • 1995: Released as open-source software • 1997: R core group formed • 2000: R 1.0.0 released • 2003: R Foundation formed in Austria • 2004: First international user conference • 2007: Revolution Analytics founded • 2009: New York Times article on R • 2013: Revolution R Open released • 2015: Microsoft acquires Revolution Analytics 7 Photo credit: Robert Gentleman
  6. 6. blog.revolutionanalytics.com/popularity R Usage Growth Rexer Data Miner Survey, 2007-2013 • Rexer Data Miner Survey • IEEE Spectrum, July 2014 #9: R Language Popularity IEEE Spectrum Top Programming Languages
  7. 7. New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
  8. 8. R AT MICROSOFT
  9. 9. What happened? Why did it happen? What will happen? How can we make it happen? Traditional BI Advanced Analytics
  10. 10. • System monitoring & alerting • Capacity Planning
  11. 11. • TruSkill Matchmaking System • Player Churn • Game design • In-game purchase optimization • Fraud detection • Player communities
  12. 12. MICROSOFT PRODUCTS WITH R
  13. 13. • Enhanced Open Source R distribution • Compatible with all R-related software • Multi-threaded for performance • Focus on reproducibility • Open source (GPLv2 license) • Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE • Download from mran.revolutionanalytics.com 15
  14. 14. • Built on latest R engine • 100% compatible with • Designed to work with RStudio 16
  15. 15. • Multithreaded library replaces standard BLAS/LAPACK algorithms • High-performance algorithms • Sequential  Parallel • No need to change any R code • Included with RRO binary distributions 17 More at Revolutions blog
  16. 16. Adapted from http://xkcd.com/234/ CC BY-NC 2.5
  17. 17. • Static CRAN mirror • Daily CRAN snapshots mran.revolutionanalytics.com/snapshot • Easily write and share scripts synced to a specific snapshot 19 CRAN RRDaily snapshots http://mran.revolutionanalytics.com/snapshot/ checkpoint package library(checkpoint) checkpoint("2014-09-17") CRAN mirror http://cran.revolutionanalytics.com/ checkpoint server Midnight UTC
  18. 18. • Easy to use: add 2 lines to the top of each script • For the package author: • For a script collaborator: 20
  19. 19. • Download Revolution R Open • Learn about R and RRO • Daily CRAN snapshots • Explore Packages • Explore Task Views 21
  20. 20. Trends
  21. 21. R FOR BIG DATA
  22. 22. • Toolkits for data scientists and numerical analysts to create custom parallel and distributed algorithms • Mainly useful for “embarrassingly parallel” problems, where parallel components work with small amounts of data • Big Data Predictive Analytics mostly not embarrassingly parallel Details at projects.revolutionanalytics.com 24
  23. 23. is…. the only big data big analytics platform based on open source R the defacto statistical computing language for modern analytics
  24. 24.  Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination New in v7.3  PEMA-R API  rxDataStep  rxExec Coming in v7.4
  25. 25. • ETL • Marketing channel data • Behavioral variables • Promotional data • Overlay data • Exploratory data analysis • Time-to-event models • GAM survival models • Scoring for inference • Scoring for prediction • 5 billion scores per day per retailer CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)
  26. 26. R IN THE CLOUD
  27. 27. • Exposing the expertise of data scientists as APIs • Bringing the utility of data science to applications • Addressing the Data Science talent gap
  28. 28. Azure: Huge infrastructure scale 19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing  100+ datacenters  One of the top 3 networks in the world (coverage, speed, connections)  2 x AWS and 6x Google number of offered regions  G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD… Operational Announced Central US Iowa West US California North Europe Ireland East US Virginia East US 2 Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo West Europe Netherlands China North * Beijing China South * Shanghai Japan East Saitama Japan West OsakaIndia West TBD India East TBD East Asia Hong Kong SE Asia Singapore Australia West Melbourne Australia East Sydney * Operated by 21Vianet
  29. 29. http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
  30. 30. WHAT’S COMING FOR R AT MICROSOFT
  31. 31. 40
  32. 32. Data Scientist Interact directly with data Built-in to SQL Server Data Developer/DBA Manage data and analytics together SQL Server 2016 Built-in in-database analytics Example Solutions • Fraud detection • Salesforecasting • Warehouse efficiency • Predictive maintenance Relational Data Analytic Library T-SQL Interface Extensibility ? R RIntegration 010010 100100 010101 Microsoft Azure Machine Learning Marketplace New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101
  33. 33. rows minutes R on a server pulling data via SQL R on a server Invoking RRE ScaleR Inside the EDW
  34. 34. Thank you Download Revolution R Open: mran.revolutionanalytics.com More at: blog.revolutionanalytics.com David Smith R Community Lead Revolution Analytics @revodavid davidsmi@microsoft.com
  35. 35. 46 More at deployr.revolutionanalytics.com

×