11. 3. Large Scale Machine Learning
Data Featurize Learning Model
12. 3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(
“t.csv”)
model <- glm(
delay~Distance+Dest,
family = “gaussian”,
data=data)
summary(model)
13. Big Data & R
Big Data
Small Learning
Partition
Aggregate
Large Scale
Machine Learning
SparkR:
Unified approach
14. SparkR DataFrames
people <- read.df(
“people.json”,
“json”)
avgAge <- select(
df,
avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs
15. Large Scale Machine Learning
Integration with MLLib
Key Features
R-like formulas
Model statistics
model <- glm(
a ~ b + c,
data = df)
summary(model)
17. SparkR Status
Open source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks,
IBM, Intel, Alteryx etc.
Contributions welcome !
18. Tutorial Outline
Part 1: Data Exploration
• ETL: Data loading, schema
• Exploration: Filter, clean, aggregate etc.
• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)
19. Tutorial Setup
Each user gets a dedicated micro cluster
• Cluster is terminated after 1 hour of inactivity
• Multiple users can collaborate on a notebook
Notebooks can be exported/imported
Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark