SlideShare ist ein Scribd-Unternehmen logo
1 von 2
Downloaden Sie, um offline zu lesen
Data Science in Spark
with sparklyr
Cheat Sheet
Intro
sparklyris an R interface for
Apache Spark™, it provides a
complete dplyr backend and
the option to query directly
using Spark SQL statement. With sparklyr, you
can orchestrate distributed machine learning
using either Spark’s MLlib or H2O Sparkling
Water.
www.rstudio.com
sparklyr
Data Science Toolchain with Spark + sparklyr
fd
Import
• Export an R
DataFrame
• Read a file
• Read existing
Hive table
fd
Tidy
• dplyr verb
• Direct Spark
SQL (DBI)
• SDF function
(Scala API)
fd
Transform
Transformer
function
fd
Model
• Spark MLlib
• H2O Extension
fd
Visualize
Collect data into
R for plotting
Communicate
• Collect data
into R
• Share plots,
documents,
and apps
R for Data Science, Grolemund & Wickham
Getting started
1. Install RStudio Server or RStudio Pro on one
of the existing nodes, preferably an edge
node
On a YARN Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark Home
Directory, it normally is “/usr/lib/spark”
spark_connect(master=“yarn-client”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
1. Install RStudio Server or Pro on one of the
existing nodes
On a Mesos Managed Cluster
3. Open a connection
2. Locate path to the cluster’s Spark directory
spark_connect(master=“[mesos URL]”,
version = “1.6.2”, spark_home = [Cluster’s
Spark path])
Disconnect
Open the
Spark UI
Spark & Hive Tables
Open connection log
Preview
1K rows
RStudio Integrates with sparklyr
Starting with version 1.044, RStudio Desktop,
Server and Pro include integrated support for
the sparklyr package. You can create and
manage connections to Spark clusters and local
Spark instances from inside the IDE.
Using sparklyr
library(sparklyr); library(dplyr); library(ggplot2);
library(tidyr);
set.seed(100)
spark_install("2.0.1")
sc <- spark_connect(master = "local")
import_iris <- copy_to(sc, iris, "spark_iris",
overwrite = TRUE)
partition_iris <- sdf_partition(
import_iris,training=0.5, testing=0.5)
sdf_register(partition_iris,
c("spark_iris_training","spark_iris_test"))
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
model_iris <- tidy_iris %>%
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
test_iris <- tbl(sc,"spark_iris_test")
pred_iris <- sdf_predict(
model_iris, test_iris) %>%
collect
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
spark_disconnect(sc)
Partition
data
Install Spark locally
Connect to local version
Copy data to Spark memory
Create a hive metadata for each partition
Bring data back
into R memory
for plotting
A brief example of a data analysis using
Apache Spark, R and sparklyr in local mode
Spark ML
Decision Tree
Model
Create
reference to
Spark table
• spark.executor.heartbeatInterval
• spark.network.timeout
• spark.executor.memory
• spark.executor.cores
• spark.executor.extraJavaOptions
• spark.executor.instances
• sparklyr.shell.executor-memory
• sparklyr.shell.driver-memory
Important Tuning Parameters
with defaults continuedconfig <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1")
Cluster Deployment
Example Configuration
• spark.yarn.am.cores
• spark.yarn.am.memory
Important Tuning Parameters
with defaults
512m
10s
120s
1g
1
Managed Cluster
Driver Node
fd
Worker Nodes
YARN
Mesos
or
fd
fd
Driver Node
fd
fd
fd
Worker NodesCluster
Manager
Stand Alone Cluster
Cluster Deployment Options
Understand
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
Disconnect
1. Install RStudio Server or RStudio Pro on
one of the existing nodes or a server in the
same LAN
On a Spark Standalone Cluster
3. Open a connection
2. Install a local version of Spark:
spark_install (version = “2.0.1")
spark_connect(master=“spark://
host:port“, version = "2.0.1",
spark_home = spark_home_dir())
1. The Livy REST application should be running
on the cluster
2. Connect to the cluster
sc <- spark_connect(master = “http://host:port” ,
method = “livy")
Using Livy (Experimental)
Tuning Spark
Local Mode
2. Open a connection
1. Install a local version of Spark:
spark_install ("2.0.1")
Easy setup; no cluster required
sc <- spark_connect (master = "local")
Wrangle
Visualize & Communicate
Download data to R memory
dplyr::collect(x)
r_table <- collect(my_table)
plot(Petal_Width~Petal_Length, data=r_table)
sdf_read_column(x, column)
Returns contents of a single column to R
my_var <- tbl_cache(sc,
name= "hive_iris")
From a table in Hive
tbl_cache(sc, name, force = TRUE)
Loads the table into memory
my_var <- dplyr::tbl(sc,
name= "hive_iris")
dplyr::tbl(scr, …)
Creates a reference to the table
without loading it into memory
Import
sdf_copy_to(sc, x, name, memory, repartition,
overwrite)
sdf_copy_to(sc, iris, "spark_iris")
DBI::dbWriteTable(

sc, "spark_iris", iris)
Spark SQL commands
DBI::dbWriteTable(conn, name,
value)
Copy a DataFrame into Spark
Model (MLlib)
ml_als_factorization(x, rating.column = "rating", user.column =
"user", item.column = "item", rank = 10L, regularization.parameter =
0.1, iter.max = 10L, ml.options = ml_options())
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
= 5L, type = c("auto", "regression", "classification"), ml.options =
ml_options())
ml_generalized_linear_regression(x, response, features,
intercept = TRUE, family = gaussian(link = "identity"), iter.max =
100L, ml.options = ml_options())
Same options for: ml_gradient_boosted_trees
ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
(50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
ml_linear_regression(x, response, features, intercept = TRUE,
alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
Same options for: ml_logistic_regression
ml_multilayer_perceptron(x, response, features, layers, iter.max
= 100, seed = sample(.Machine$integer.max, 1), ml.options =
ml_options())
ml_naive_bayes(x, response, features, lambda = 0, ml.options =
ml_options())
ml_one_vs_rest(x, classifier, response, features, ml.options =
ml_options())
ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
ml_random_forest(x, response, features, max.bins = 32L,
max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
"classification"), ml.options = ml_options())
ml_survival_regression(x, response, features, intercept =
TRUE,censor = "censor", iter.max = 100L, ml.options =
ml_options())
ml_binary_classification_eval(predicted_tbl_spark, label,
score, metric = "areaUnderROC")
ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
metric = "f1")
ml_tree_feature_importance(sc, model)
ml_decision_tree(my_table , response=“Species", features=
c(“Petal_Length" , "Petal_Width"))
Wrangle
Spark SQL via dplyr verbs
Translates into Spark SQL statements
my_table <- my_var %>%
filter(Species=="setosa") %>%
sample_n(10)
my_table <- DBI::dbGetQuery( sc , ”SELECT *
FROM iris LIMIT 10")
Direct Spark SQL commands
DBI::dbGetQuery(conn, statement)
Scala API via SDF functions
sdf_mutate(.data)
Works like dplyr mutate function
sdf_partition(x, ..., weights = NULL, seed
= sample (.Machine$integer.max, 1))
sdf_partition(x, training = 0.5, test = 0.5)
sdf_register(x, name = NULL)
Gives a Spark DataFrame a table name
sdf_sample(x, fraction = 1, replacement =
TRUE, seed = NULL)
sdf_sort(x, columns)
Sorts by >=1 columns in ascending order
sdf_with_unique_id(x, id = "id")
Add unique ID column
ft_binarizer(threshold = 0.5)
ft_bucketizer(splits)
ft_discrete_cosine_transform(invers
e = FALSE)
ft_elementwise_product(scaling.col)
ft_index_to_string()
ft_one_hot_encoder()
ft_quantile_discretizer( n.buckets
= 5L)
ft_sql_transformer(sql)
ft_string_indexer( params = NULL)
ft_vector_assembler()
Combine vectors into a single row-
vector
Column of labels into a column of
label indices.
Continuous to binary vectors
Continuous to binned categorical
values
Index labels back to label as strings
Element-wise product between 2
columns
Time domain to frequency domain
Numeric column to discretized column
Assigned values based on threshold
ft_binarizer(my_table,input.col=“Petal_
Length”, output.col="petal_large",
threshold=1.2)
Arguments that apply to all functions:
x, input.col = NULL, output.col = NULL
Reading & Writing from Apache Spark
spark_read_<fmt>
sdf_copy_to
DBI::dbWriteTable
dplyr::collect
sdf_read_column
spark_write_<fmt>
tbl_cache
dplyr::tbl
File
System
Download a Spark DataFrame to an R DataFrame
spark_jobj()
Extensions
Create an R package that calls the full Spark API &
provide interfaces to Spark packages.
Core Types
spark_connection() Connection between R and the
Spark shell process
Instance of a remote Spark object
spark_dataframe()
Instance of a remote Spark
DataFrame object
Call Spark from R
invoke() Call a method on a Java object
invoke_new() Create a new object by invoking a
constructor
invoke_static() Call a static method on an object
spark_jobj()
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16
spark_read_csv( header = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json(mode = NULL)
spark_read_parquet(mode = NULL)
Arguments that apply to all functions: x, path
Save from Spark to File System
CSV
JSON
PARQUET
spark_read_csv( header = TRUE,
columns = NULL, infer_schema = TRUE,
delimiter = ",", quote = """, escape = "",
charset = "UTF-8", null_value = NULL)
spark_read_json()
spark_read_parquet()
Arguments that apply to all functions:
sc, name, path, options = list(), repartition = 0,
memory = TRUE, overwrite = TRUE
Import into Spark from a File
CSV
JSON
PARQUET
sdf_predict(object, newdata)
Spark DataFrame with predicted values
sdf_collect
dplyr::copy_to
spark_dataframe()
sparklyr
is an R
interface
for
Machine Learning Extensions
ml_create_dummy_variables()
ml_model()ml_prepare_dataframe()
ml_prepare_response_features_intercept()
ml_options()
ML Transformers

Weitere ähnliche Inhalte

Was ist angesagt?

Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In RRsquared Academy
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
Hive function-cheat-sheet
Hive function-cheat-sheetHive function-cheat-sheet
Hive function-cheat-sheetDr. Volkan OBAN
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysisTim Essam
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programmingAlberto Labarga
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Lucas Witold Adamus
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RRsquared Academy
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RRsquared Academy
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 

Was ist angesagt? (20)

Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Hive function-cheat-sheet
Hive function-cheat-sheetHive function-cheat-sheet
Hive function-cheat-sheet
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Language R
Language RLanguage R
Language R
 
R programming language
R programming languageR programming language
R programming language
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysis
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?Why async and functional programming in PHP7 suck and how to get overr it?
Why async and functional programming in PHP7 suck and how to get overr it?
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 

Ähnlich wie Sparklyr

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using sparkRan Silberman
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 

Ähnlich wie Sparklyr (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 

Mehr von Dieudonne Nahigombeye (8)

Rstudio ide-cheatsheet
Rstudio ide-cheatsheetRstudio ide-cheatsheet
Rstudio ide-cheatsheet
 
Reg ex cheatsheet
Reg ex cheatsheetReg ex cheatsheet
Reg ex cheatsheet
 
How big-is-your-graph
How big-is-your-graphHow big-is-your-graph
How big-is-your-graph
 
Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1Ggplot2 cheatsheet-2.1
Ggplot2 cheatsheet-2.1
 
Eurostat cheatsheet
Eurostat cheatsheetEurostat cheatsheet
Eurostat cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Base r
Base rBase r
Base r
 
Advanced r
Advanced rAdvanced r
Advanced r
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Sparklyr

  • 1. Data Science in Spark with sparklyr Cheat Sheet Intro sparklyris an R interface for Apache Spark™, it provides a complete dplyr backend and the option to query directly using Spark SQL statement. With sparklyr, you can orchestrate distributed machine learning using either Spark’s MLlib or H2O Sparkling Water. www.rstudio.com sparklyr Data Science Toolchain with Spark + sparklyr fd Import • Export an R DataFrame • Read a file • Read existing Hive table fd Tidy • dplyr verb • Direct Spark SQL (DBI) • SDF function (Scala API) fd Transform Transformer function fd Model • Spark MLlib • H2O Extension fd Visualize Collect data into R for plotting Communicate • Collect data into R • Share plots, documents, and apps R for Data Science, Grolemund & Wickham Getting started 1. Install RStudio Server or RStudio Pro on one of the existing nodes, preferably an edge node On a YARN Managed Cluster 3. Open a connection 2. Locate path to the cluster’s Spark Home Directory, it normally is “/usr/lib/spark” spark_connect(master=“yarn-client”, version = “1.6.2”, spark_home = [Cluster’s Spark path]) 1. Install RStudio Server or Pro on one of the existing nodes On a Mesos Managed Cluster 3. Open a connection 2. Locate path to the cluster’s Spark directory spark_connect(master=“[mesos URL]”, version = “1.6.2”, spark_home = [Cluster’s Spark path]) Disconnect Open the Spark UI Spark & Hive Tables Open connection log Preview 1K rows RStudio Integrates with sparklyr Starting with version 1.044, RStudio Desktop, Server and Pro include integrated support for the sparklyr package. You can create and manage connections to Spark clusters and local Spark instances from inside the IDE. Using sparklyr library(sparklyr); library(dplyr); library(ggplot2); library(tidyr); set.seed(100) spark_install("2.0.1") sc <- spark_connect(master = "local") import_iris <- copy_to(sc, iris, "spark_iris", overwrite = TRUE) partition_iris <- sdf_partition( import_iris,training=0.5, testing=0.5) sdf_register(partition_iris, c("spark_iris_training","spark_iris_test")) tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width) model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width")) test_iris <- tbl(sc,"spark_iris_test") pred_iris <- sdf_predict( model_iris, test_iris) %>% collect pred_iris %>% inner_join(data.frame(prediction=0:2, lab=model_iris$model.parameters$labels)) %>% ggplot(aes(Petal_Length, Petal_Width, col=lab)) + geom_point() spark_disconnect(sc) Partition data Install Spark locally Connect to local version Copy data to Spark memory Create a hive metadata for each partition Bring data back into R memory for plotting A brief example of a data analysis using Apache Spark, R and sparklyr in local mode Spark ML Decision Tree Model Create reference to Spark table • spark.executor.heartbeatInterval • spark.network.timeout • spark.executor.memory • spark.executor.cores • spark.executor.extraJavaOptions • spark.executor.instances • sparklyr.shell.executor-memory • sparklyr.shell.driver-memory Important Tuning Parameters with defaults continuedconfig <- spark_config() config$spark.executor.cores <- 2 config$spark.executor.memory <- "4G" sc <- spark_connect (master = "yarn- client", config = config, version = "2.0.1") Cluster Deployment Example Configuration • spark.yarn.am.cores • spark.yarn.am.memory Important Tuning Parameters with defaults 512m 10s 120s 1g 1 Managed Cluster Driver Node fd Worker Nodes YARN Mesos or fd fd Driver Node fd fd fd Worker NodesCluster Manager Stand Alone Cluster Cluster Deployment Options Understand RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16 Disconnect 1. Install RStudio Server or RStudio Pro on one of the existing nodes or a server in the same LAN On a Spark Standalone Cluster 3. Open a connection 2. Install a local version of Spark: spark_install (version = “2.0.1") spark_connect(master=“spark:// host:port“, version = "2.0.1", spark_home = spark_home_dir()) 1. The Livy REST application should be running on the cluster 2. Connect to the cluster sc <- spark_connect(master = “http://host:port” , method = “livy") Using Livy (Experimental) Tuning Spark Local Mode 2. Open a connection 1. Install a local version of Spark: spark_install ("2.0.1") Easy setup; no cluster required sc <- spark_connect (master = "local") Wrangle
  • 2. Visualize & Communicate Download data to R memory dplyr::collect(x) r_table <- collect(my_table) plot(Petal_Width~Petal_Length, data=r_table) sdf_read_column(x, column) Returns contents of a single column to R my_var <- tbl_cache(sc, name= "hive_iris") From a table in Hive tbl_cache(sc, name, force = TRUE) Loads the table into memory my_var <- dplyr::tbl(sc, name= "hive_iris") dplyr::tbl(scr, …) Creates a reference to the table without loading it into memory Import sdf_copy_to(sc, x, name, memory, repartition, overwrite) sdf_copy_to(sc, iris, "spark_iris") DBI::dbWriteTable(
 sc, "spark_iris", iris) Spark SQL commands DBI::dbWriteTable(conn, name, value) Copy a DataFrame into Spark Model (MLlib) ml_als_factorization(x, rating.column = "rating", user.column = "user", item.column = "item", rank = 10L, regularization.parameter = 0.1, iter.max = 10L, ml.options = ml_options()) ml_decision_tree(x, response, features, max.bins = 32L, max.depth = 5L, type = c("auto", "regression", "classification"), ml.options = ml_options()) ml_generalized_linear_regression(x, response, features, intercept = TRUE, family = gaussian(link = "identity"), iter.max = 100L, ml.options = ml_options()) Same options for: ml_gradient_boosted_trees ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x), compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options()) ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha = (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options()) ml_linear_regression(x, response, features, intercept = TRUE, alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options()) Same options for: ml_logistic_regression ml_multilayer_perceptron(x, response, features, layers, iter.max = 100, seed = sample(.Machine$integer.max, 1), ml.options = ml_options()) ml_naive_bayes(x, response, features, lambda = 0, ml.options = ml_options()) ml_one_vs_rest(x, classifier, response, features, ml.options = ml_options()) ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options()) ml_random_forest(x, response, features, max.bins = 32L, max.depth = 5L, num.trees = 20L, type = c("auto", "regression", "classification"), ml.options = ml_options()) ml_survival_regression(x, response, features, intercept = TRUE,censor = "censor", iter.max = 100L, ml.options = ml_options()) ml_binary_classification_eval(predicted_tbl_spark, label, score, metric = "areaUnderROC") ml_classification_eval(predicted_tbl_spark, label, predicted_lbl, metric = "f1") ml_tree_feature_importance(sc, model) ml_decision_tree(my_table , response=“Species", features= c(“Petal_Length" , "Petal_Width")) Wrangle Spark SQL via dplyr verbs Translates into Spark SQL statements my_table <- my_var %>% filter(Species=="setosa") %>% sample_n(10) my_table <- DBI::dbGetQuery( sc , ”SELECT * FROM iris LIMIT 10") Direct Spark SQL commands DBI::dbGetQuery(conn, statement) Scala API via SDF functions sdf_mutate(.data) Works like dplyr mutate function sdf_partition(x, ..., weights = NULL, seed = sample (.Machine$integer.max, 1)) sdf_partition(x, training = 0.5, test = 0.5) sdf_register(x, name = NULL) Gives a Spark DataFrame a table name sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL) sdf_sort(x, columns) Sorts by >=1 columns in ascending order sdf_with_unique_id(x, id = "id") Add unique ID column ft_binarizer(threshold = 0.5) ft_bucketizer(splits) ft_discrete_cosine_transform(invers e = FALSE) ft_elementwise_product(scaling.col) ft_index_to_string() ft_one_hot_encoder() ft_quantile_discretizer( n.buckets = 5L) ft_sql_transformer(sql) ft_string_indexer( params = NULL) ft_vector_assembler() Combine vectors into a single row- vector Column of labels into a column of label indices. Continuous to binary vectors Continuous to binned categorical values Index labels back to label as strings Element-wise product between 2 columns Time domain to frequency domain Numeric column to discretized column Assigned values based on threshold ft_binarizer(my_table,input.col=“Petal_ Length”, output.col="petal_large", threshold=1.2) Arguments that apply to all functions: x, input.col = NULL, output.col = NULL Reading & Writing from Apache Spark spark_read_<fmt> sdf_copy_to DBI::dbWriteTable dplyr::collect sdf_read_column spark_write_<fmt> tbl_cache dplyr::tbl File System Download a Spark DataFrame to an R DataFrame spark_jobj() Extensions Create an R package that calls the full Spark API & provide interfaces to Spark packages. Core Types spark_connection() Connection between R and the Spark shell process Instance of a remote Spark object spark_dataframe() Instance of a remote Spark DataFrame object Call Spark from R invoke() Call a method on a Java object invoke_new() Create a new object by invoking a constructor invoke_static() Call a static method on an object spark_jobj() RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Learn more at spark.rstudio.com • package version 0.5 • Updated: 12/21/16 spark_read_csv( header = TRUE, delimiter = ",", quote = """, escape = "", charset = "UTF-8", null_value = NULL) spark_read_json(mode = NULL) spark_read_parquet(mode = NULL) Arguments that apply to all functions: x, path Save from Spark to File System CSV JSON PARQUET spark_read_csv( header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = ",", quote = """, escape = "", charset = "UTF-8", null_value = NULL) spark_read_json() spark_read_parquet() Arguments that apply to all functions: sc, name, path, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE Import into Spark from a File CSV JSON PARQUET sdf_predict(object, newdata) Spark DataFrame with predicted values sdf_collect dplyr::copy_to spark_dataframe() sparklyr is an R interface for Machine Learning Extensions ml_create_dummy_variables() ml_model()ml_prepare_dataframe() ml_prepare_response_features_intercept() ml_options() ML Transformers