Enabling Exploratory Analysis of Large Data with Apache Spark and R

Enabling Exploratory Data Science
with Apache Spark and R
Hossein Falaki (@mhfalaki)

About the speaker: Hossein Falaki
Hossein Falaki is a software engineeratDatabricks
working on the next big thing.Prior to that, he was
a data scientistat Apple’spersonal assistant, Siri.
He graduated with Ph.D. in Computer Science
from UCLA, where he was a member of the Center
for Embedded Networked Sensing (CENS).
2

About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
3

We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.

Why do we like R?
5
• Open source
• Highly dynamic
• Interactive environment
• Rich ecosystem of packages
• Powerful visualization infrastructure
• Data frames make data manipulation convenient
• Taughtby many schoolsto stats and computing students

What would be ideal?
Seamless manipulationand analysisof very large data in R
• R’s flexible syntax
• R’s rich package ecosystem
• R’s interactive environment
• Scalability (scaleup and out)
• Integration with distributed data sources/ storage
6

Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks
(Hadoop MR, Hive, Pig, Relational Databases, etc)
7
Framework X
(Language Y)
Distributed
Storage
1. Load, clean, transform, aggregate, sample
Local
Storage
2. Save to local storage 3. Read and analyze in R
Iterate

What is SparkR?
An R package distributed with ApacheSpark:
• Provides R frontend to Spark
• Exposes Spark Dataframes (inspired by R and Pandas)
• Convenientinteroperability between R and Spark DataFrames
8
+distributed/robust processing, data
sources, off-memory data structures
Spark
Dynamic environment, interactivity,
packages, visualization
R

How does SparkR solve our problems?
No local storage involved
Write everything in R
Use Spark’s distributed cachefor interactive/iterative analysis at
speed of thought
9
Local
Storage
2. Save to local storage 3. Read and analyze in R
Framework X
(Language Y)
Distributed
Storage
1. Load, clean, transform, aggregate, sample
Iterate

Example SparkR program
# Loading distributed data
df <- read.df(“hdfs://bigdata/logs”, source = “json”)
# Distributed filtering and aggregation
errors <- subset(df, df$type == “error”)
counts <- agg(groupBy(errors, df$code), num = count(df$code))
# Collecting and plotting small data
qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip()
10

SparkR architecture
11
Spark Driver
R JVM
R Backend
JVM
Worker
JVM
Worker
Data Sources

Overview of SparkR API
IO
• read.df / write.df
• createDataFrame / collect
Caching
• cache / persist / unpersist
• cacheTable / uncacheTable
Utility functions
• dim / head / take
• names / rand / sample / ...
12
ML Lib
• glm / kmeans /
DataFrame API
select / subset / groupBy
head / showDF /unionAll
agg / avg / column / ...
SQL
sql / table / saveAsTable
registerTempTable / tables

Overview of SparkR API :: SQLContext
SQLContextis your interface to Spark functionality in R
o SparkR DataFrames are implemented on top of SparkSQLtables
o All DataFrame operations go througha SQL optimizer (catalyst)
13
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
From now on, you don’t need Spark Context(sc) any more.

Moving data between R and JVM
14
R JVM
R Backend
JVM
Worker
JVM
Worker
HDFS/S3/…
read.df()
write.df()

Moving data between R and JVM
15
R JVM
R Backend
SparkR::collect()
SparkR::createDataFrame()

Overview of SparkR API :: Caching
16
Controlscashing of distributed data:
o persist(sparkDF, storageLevel)
o DISK_ONLY
o MEMORY_AND_DISK
o MEMORY_AND_DISK_SER
o MEMORY_ONLY
o MEMORY_ONLY_SER
o OFF_HEAP
o cache(sparkdF) == persist(sparkDF, “MEMORY_ONLY”)
o cacheTable(sqlContext, “table_name”)

Overview of SparkR API :: DataFrame API
SparkR DataFrame behavessimilar to R data.frames
o sparkDF$newCol <- sparkDF$col + 1
o subsetDF <- sparkDF[, c(“date”, “type”)]
o recentData <- subset(sparkDF$date == “2015-10-24”)
o firstRow <- sparkDF[[1, ]]
o names(subsetDF) <- c(“Date”, “Type”)
o dim(recentData)
o head(collect(count(group_by(subsetDF, “Date”))))
17

Overview of SparkR API :: SQL
You can register a DataFrame as a table and queryit in SQL
o logs <- read.df(sqlContext, “data/logs”, source = “json”)
o registerTempTable(df, “logsTable”)
o errorsByCode <- sql(sqlContext, “select count(*) as num, type from logsTable where type
== “error” group by code order by date desc”)
o reviewsDF <- table(sqlContext, “reviewsTable”)
o registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)
18

Mixing R and SQL
Pass a query to SQLContextand getthe resultback as a DataFrame
19
# Register DataFrame as a table
registerTempTable(df, “dataTable”)
# Complex SQL query, result is returned as another DataFrame
aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date
desc”)
qplot(date, num, data = collect(aggCount), geom = “line”)

Moving between languages
20
R Scala
Spark
df <- read.df(...)
wiki <- filter(df, ...)
registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”)
val parsed = wiki.map {
Row(_, _, text: String, _, _)
=>text.split(‘ ’)
}
val model = Kmeans.train(parsed)

How to get started with SparkR?
• On your computer
1. Download latest version ofSpark (2.0)
2. Build (maven orsbt)
3. Run ./install-dev.sh inside the R directory
4. Start R shell by running ./bin/sparkR
• Deploy Spark on your cluster
• Sign up for Databricks Community Edition:
https://databricks.com/try-databricks
22

Summary
1. SparkR is an R frontend to ApacheSpark
2. Distributed data resides in the JVM
3. Workers are not runningR process(yet)
4. Distinction between Spark DataFrames and R data frames
24

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Enabling Exploratory Analysis of Large Data with Apache Spark and R

Ähnlich wie Enabling Exploratory Analysis of Large Data with Apache Spark and R (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Enabling Exploratory Analysis of Large Data with Apache Spark and R