SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Enabling Exploratory Data
Science with Spark and R
Shivaram Venkataraman, Hossein Falaki (@mhfalaki)
About Apache Spark, AMPLab and Databricks
Apache Spark is a general distributed computingengine that unifies:
• Real-time streaming (SparkStreaming)
• Machine learning (SparkML/MLLib)
• SQL (SparkSQL)
• Graph processing (GraphX)
AMPLab (Algorithms, Machines, and Peoples lab) at UC Berkeley was where Spark and SparkR
were developed originally.
Databricks Inc. is the company founded by creators of Spark, focused on making big data
simple by offering an end to end data processing platform in the cloud
2
What is R?
Language and runtime
The cornerstone of R is the
data frame concept
3
Many data scientists love R
4
• Open source
• Highly dynamic
• Interactive environment
• Rich ecosystem of packages
• Powerful visualization infrastructure
• Data frames make data manipulation convenient
• Taughtby many schoolsto stats and computing students
Performance Limitations of R
R language
• R’s dynamic design imposes restrictions on optimization
R runtime
• Single threaded
• Everything has to fit in memory
5
What would be ideal?
Seamless manipulationand analysisof very large data in R
• R’s flexible syntax
• R’s rich package ecosystem
• R’s interactive environment
• Scalability (scaleup and out)
• Integration with distributed data sources/ storage
6
Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks
(Hadoop MR, Hive, Pig, Relational Databases, etc)
7
Framework	
  X
(Language	
  Y)
Distributed
Storage
1.	
  Load,	
  clean,	
  transform,	
  aggregate,	
  sample
Local
Storage
2.	
  Save	
  to	
  local	
  storage 3.	
  Read	
  and	
  analyze	
  in	
  R
Iterate
What is SparkR?
An R package distributed with ApacheSpark:
• Provides R frontend to Spark
• Exposes Spark Dataframes (inspired by R and Pandas)
• Convenientinteroperability between R and Spark DataFrames
8
+distributed/robust	
   processing,	
  data	
  
sources,	
  off-­‐memory	
  data	
  structures
Spark
Dynamic	
  environment,	
  interactivity,	
  
packages,	
  visualization
R
How does SparkR solve our problems?
No local storage involved
Write everything in R
Use Spark’s distributed cachefor interactive/iterative analysis at
speed of thought
9
Local
Storage
2.	
  Save	
  to	
  local	
  storage 3.	
  Read	
  and	
  analyze	
  in	
  R
Framework	
  X
(Language	
  Y)
Distributed
Storage
1.	
  Load,	
  clean,	
  transform,	
  aggregate,	
  sample
Iterate
Example SparkR program
# Loading distributed data
df <- read.df(“hdfs://bigdata/logs”, source = “json”)
# Distributed filtering and aggregation
errors <- subset(df, df$type == “error”)
counts <- agg(groupBy(errors, df$code), num = count(df$code))
# Collecting and plotting small data
qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip()
10
SparkR architecture
11
Spark	
  Driver
R JVM
R	
  Backend
JVM
Worker
JVM
Worker
Data	
  Sources
Overview of SparkR API
IO
• read.df / write.df
• createDataFrame / collect
Caching
• cache / persist / unpersist
• cacheTable / uncacheTable
Utility functions
• dim / head / take
• names / rand / sample / ...
12
ML Lib
• glm / predict
DataFrame API
select / subset / groupBy
head / showDF /unionAll
agg / avg / column / ...
SQL
sql / table / saveAsTable
registerTempTable / tables
Moving data between R and JVM
13
R JVM
R	
  Backend
SparkR::collect()
SparkR::createDataFrame()
Moving data between R and JVM
14
R JVM
R	
  Backend
JVM
Worker
JVM
Worker
HDFS/S3/…
FUSE
read.df()
write.df()
Moving between languages
15
R Scala
Spark
df <- read.df(...)
wiki <- filter(df, ...)
registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”)
val parsed = wiki.map {
Row(_, _, text: String, _, _)
=>text.split(‘ ’)
}
val model = Kmeans.train(parsed)
Mixing R and SQL
Pass a query to SQLContextand getthe resultback as a DataFrame
16
# Register DataFrame as a table
registerTempTable(df, “dataTable”)
# Complex SQL query, result is returned as another DataFrame
aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date
desc”)
qplot(date, num, data = collect(aggCount), geom = “line”)
SparkR roadmap and upcoming features
• Exposing MLLib functionality in SparkR
• GLM already exposed with R formula support
• UDF supportin R
• Distribute a function and data
• Ideal way for distributing existing R functionality and packages
• Complete DataFrame API to behave/feeljust like data.frame
17
Example use case: exploratory analysis
• Data pipeline implemented in Scala/Python
• New files are appended to existing data partitioned by time
• Table schemeis saved in Hive metastore
• Data scientists use SparkRto analyze and visualize data
1. refreshTable(sqlConext,“logsTable”)
2. logs <- table(sqlContext,“logsTable”)
3. Iteratively analyze/aggregate/visualize using Spark & R DataFrames
4. Publish/share results
18
Demo
19
How to get started with SparkR?
• On your computer
1. Download latest version ofSpark (1.5.2)
2. Build (maven orsbt)
3. Run ./install-dev.sh inside the R directory
4. Start R shell by running ./bin/sparkR
• Deploy Spark (1.4+) on your cluster
• Sign up for 14 days free trial at Databricks
20
Summary
1. SparkR is an R frontend to ApacheSpark
2. Distributed data resides in the JVM
3. Workers are not runningR process(yet)
4. Distinction between Spark DataFrames and R data frames
21
Further pointers
http://spark.apache.org
http://www.r-project.org
http://www.ggplot2.org
https://cran.r-project.org/web/packages/magrittr
www.databricks.com
Office hour: 13-14 Databricks Booth
22
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Was ist angesagt? (20)

Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 

Ähnlich wie Enabling exploratory data science with Spark and R

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Ähnlich wie Enabling exploratory data science with Spark and R (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Spark core
Spark coreSpark core
Spark core
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Kürzlich hochgeladen (20)

What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

Enabling exploratory data science with Spark and R

  • 1. Enabling Exploratory Data Science with Spark and R Shivaram Venkataraman, Hossein Falaki (@mhfalaki)
  • 2. About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computingengine that unifies: • Real-time streaming (SparkStreaming) • Machine learning (SparkML/MLLib) • SQL (SparkSQL) • Graph processing (GraphX) AMPLab (Algorithms, Machines, and Peoples lab) at UC Berkeley was where Spark and SparkR were developed originally. Databricks Inc. is the company founded by creators of Spark, focused on making big data simple by offering an end to end data processing platform in the cloud 2
  • 3. What is R? Language and runtime The cornerstone of R is the data frame concept 3
  • 4. Many data scientists love R 4 • Open source • Highly dynamic • Interactive environment • Rich ecosystem of packages • Powerful visualization infrastructure • Data frames make data manipulation convenient • Taughtby many schoolsto stats and computing students
  • 5. Performance Limitations of R R language • R’s dynamic design imposes restrictions on optimization R runtime • Single threaded • Everything has to fit in memory 5
  • 6. What would be ideal? Seamless manipulationand analysisof very large data in R • R’s flexible syntax • R’s rich package ecosystem • R’s interactive environment • Scalability (scaleup and out) • Integration with distributed data sources/ storage 6
  • 7. Augmenting R with other frameworks In practice data scientists use R in conjunction with other frameworks (Hadoop MR, Hive, Pig, Relational Databases, etc) 7 Framework  X (Language  Y) Distributed Storage 1.  Load,  clean,  transform,  aggregate,  sample Local Storage 2.  Save  to  local  storage 3.  Read  and  analyze  in  R Iterate
  • 8. What is SparkR? An R package distributed with ApacheSpark: • Provides R frontend to Spark • Exposes Spark Dataframes (inspired by R and Pandas) • Convenientinteroperability between R and Spark DataFrames 8 +distributed/robust   processing,  data   sources,  off-­‐memory  data  structures Spark Dynamic  environment,  interactivity,   packages,  visualization R
  • 9. How does SparkR solve our problems? No local storage involved Write everything in R Use Spark’s distributed cachefor interactive/iterative analysis at speed of thought 9 Local Storage 2.  Save  to  local  storage 3.  Read  and  analyze  in  R Framework  X (Language  Y) Distributed Storage 1.  Load,  clean,  transform,  aggregate,  sample Iterate
  • 10. Example SparkR program # Loading distributed data df <- read.df(“hdfs://bigdata/logs”, source = “json”) # Distributed filtering and aggregation errors <- subset(df, df$type == “error”) counts <- agg(groupBy(errors, df$code), num = count(df$code)) # Collecting and plotting small data qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip() 10
  • 11. SparkR architecture 11 Spark  Driver R JVM R  Backend JVM Worker JVM Worker Data  Sources
  • 12. Overview of SparkR API IO • read.df / write.df • createDataFrame / collect Caching • cache / persist / unpersist • cacheTable / uncacheTable Utility functions • dim / head / take • names / rand / sample / ... 12 ML Lib • glm / predict DataFrame API select / subset / groupBy head / showDF /unionAll agg / avg / column / ... SQL sql / table / saveAsTable registerTempTable / tables
  • 13. Moving data between R and JVM 13 R JVM R  Backend SparkR::collect() SparkR::createDataFrame()
  • 14. Moving data between R and JVM 14 R JVM R  Backend JVM Worker JVM Worker HDFS/S3/… FUSE read.df() write.df()
  • 15. Moving between languages 15 R Scala Spark df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”) val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
  • 16. Mixing R and SQL Pass a query to SQLContextand getthe resultback as a DataFrame 16 # Register DataFrame as a table registerTempTable(df, “dataTable”) # Complex SQL query, result is returned as another DataFrame aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date desc”) qplot(date, num, data = collect(aggCount), geom = “line”)
  • 17. SparkR roadmap and upcoming features • Exposing MLLib functionality in SparkR • GLM already exposed with R formula support • UDF supportin R • Distribute a function and data • Ideal way for distributing existing R functionality and packages • Complete DataFrame API to behave/feeljust like data.frame 17
  • 18. Example use case: exploratory analysis • Data pipeline implemented in Scala/Python • New files are appended to existing data partitioned by time • Table schemeis saved in Hive metastore • Data scientists use SparkRto analyze and visualize data 1. refreshTable(sqlConext,“logsTable”) 2. logs <- table(sqlContext,“logsTable”) 3. Iteratively analyze/aggregate/visualize using Spark & R DataFrames 4. Publish/share results 18
  • 20. How to get started with SparkR? • On your computer 1. Download latest version ofSpark (1.5.2) 2. Build (maven orsbt) 3. Run ./install-dev.sh inside the R directory 4. Start R shell by running ./bin/sparkR • Deploy Spark (1.4+) on your cluster • Sign up for 14 days free trial at Databricks 20
  • 21. Summary 1. SparkR is an R frontend to ApacheSpark 2. Distributed data resides in the JVM 3. Workers are not runningR process(yet) 4. Distinction between Spark DataFrames and R data frames 21