SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
BENEATH RDD
IN APACHE SPARK
USING SPARK-SHELL AND WEBUI
/ / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
Jacek Laskowski is an independent consultant
Contact me at jacek@japila.pl or
Delivering Development Services | Consulting | Training
Building and leading development teams
Mostly and these days
Leader of and
Blogger at and
@JacekLaskowski
Apache Spark Scala
Warsaw Scala Enthusiasts Warsaw Apache
Spark
Java Champion
blog.jaceklaskowski.pl jaceklaskowski.pl
http://bit.ly/mastering-apache-spark
http://bit.ly/mastering-apache-spark
SPARKCONTEXT
THE LIVING SPACE FOR RDDS
SPARKCONTEXT AND RDDS
An RDD belongs to one and only one Spark context.
You cannot share RDDs between contexts.
SparkContext tracks how many RDDs were created.
You may see it in toString output.
SPARKCONTEXT AND RDDS (2)
RDD
RESILIENT DISTRIBUTED DATASET
CREATING RDD - SC.PARALLELIZE
sc.parallelize(col, slices)to distribute a local
collection of any elements.
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at
Alternatively, sc.makeRDD(col, slices)
CREATING RDD - SC.RANGE
sc.range(start, end, step, slices)to create
RDD of long numbers.
scala> val rdd = sc.range(0, 100)
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
CREATING RDD - SC.TEXTFILE
sc.textFile(name, partitions)to create a RDD of
lines from a file.
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
CREATING RDD - SC.WHOLETEXTFILES
sc.wholeTextFiles(name, partitions)to create
a RDD of pairs of a file name and its content from a
directory.
scala> val rdd = sc.wholeTextFiles("tags")
rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
There are many more more advanced functions in
SparkContextto create RDDs.
PARTITIONS (AND SLICES)
Did you notice the words slices and partitions as
parameters?
Partitions (aka slices) are the level of parallelism.
We're going to talk about the level of parallelism later.
CREATING RDD - DATAFRAMES
RDDs are so last year :-) Use DataFrames...early and often!
A DataFrame is a higher-level abstraction over RDDs and
semi-structured data.
DataFrames require a SQLContext.
FROM RDDS TO DATAFRAMES
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [_1: int]
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
...AND VICE VERSA
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
scala> df.rdd
res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
CREATING DATAFRAMES -
SQLCONTEXT.CREATEDATAFRAME
sqlContext.createDataFrame(rowRDD, schema)
CREATING DATAFRAMES - SQLCONTEXT.READ
sqlContext.readis the modern yet experimental way.
sqlContext.read.format(f).load(path), where f
is:
jdbc
json
orc
parquet
text
EXECUTION ENVIRONMENT
PARTITIONS AND LEVEL OF PARALLELISM
The number of partitions of a RDD is (roughly) the number
of tasks.
Partitions are the hint to size jobs.
Tasks are the smallest unit of execution.
Tasks belong to TaskSets.
TaskSets belong to Stages.
Stages belong to Jobs.
Jobs, stages, and tasks are displayed in web UI.
We're going to talk about the web UI later.
PARTITIONS AND LEVEL OF PARALLELISM CD.
In local[*] mode, the number of partitions equals the
number of cores (the default in spark-shell)
scala> sc.defaultParallelism
res0: Int = 8
scala> sc.master
res1: String = local[*]
Not necessarily true when you use local or local[n] master
URLs.
LEVEL OF PARALLELISM IN SPARK CLUSTERS
TaskScheduler controls the level of parallelism
DAGScheduler, TaskScheduler, SchedulerBackend work
in tandem
DAGScheduler manages a "DAG" of RDDs (aka RDD
lineage)
SchedulerBackends manage TaskSets
DAGSCHEDULER
TASKSCHEDULER AND SCHEDULERBACKEND
RDD LINEAGE
RDD lineage is a graph of RDD dependencies.
Use toDebugString to know the lineage.
Be careful with the hops - they introduce shuffle barriers.
Why is the RDD lineage important?
This is the R in RDD - resiliency.
But deep lineage costs processing time, doesn't it?
Persist (aka cache) it early and often!
RDD LINEAGE - DEMO
What does the following do?
val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
RDD LINEAGE - DEMO CD.
How many stages are there?
// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
scala> rdd.toDebugString
res2: String =
(2) ShuffledRDD[3] at groupBy at <console>:24 []
+-(2) MapPartitionsRDD[2] at groupBy at <console>:24 []
| MapPartitionsRDD[1] at map at <console>:24 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
Nothing happens yet - processing time-wise.
SPARK CLUSTERS
Spark supports the following clusters:
one-JVM local cluster
Spark Standalone
Apache Mesos
Hadoop YARN
You use --master to select the cluster
spark://hostname:port is for Spark Standalone
And you know the local master URL, ain't you?
local, local[n], or local[*]
MANDATORY PROPERTIES OF SPARK APP
Your task: Fill in the gaps below.
Any Spark application must specify application name (aka
appName ) and master URL.
Demo time! => spark-shell is a Spark app, too!
SPARK STANDALONE CLUSTER
The built-in Spark cluster
Start standalone Master with sbin/start-master
Use -h to control the host name to bind to.
Start standalone Worker with sbin/start-slave
Run single worker per machine (aka node)
= web UI for Standalone cluster
Don't confuse it with the web UI of Spark application
Demo time! => Run Standalone cluster
http://localhost:8080/
SPARK-SHELL
SPARK REPL APPLICATION
SPARK-SHELL AND SPARK STANDALONE
You can connect to Spark Standalone using spark-shell
through --master command-line option.
Demo time! => we've already started the Standalone
cluster.
WEBUI
WEB USER INTERFACE FOR SPARK APPLICATION
WEBUI
It is available under
You can disable it using spark.ui.enabled flag.
All the events are captured by Spark listeners
You can register your own Spark listener.
Demo time! => webUI in action with different master URLs
http://localhost:4040/
QUESTIONS?
- Visit
- Follow at twitter
- Use
- Read notes.
Jacek Laskowski's blog
@jaceklaskowski
Jacek's projects at GitHub
Mastering Apache Spark

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 

Andere mochten auch

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
TheMilestoneBrand
 

Andere mochten auch (14)

IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
RDD
RDDRDD
RDD
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Opening slides to Warsaw Scala FortyFives on Testing tools
Opening slides to Warsaw Scala FortyFives on Testing toolsOpening slides to Warsaw Scala FortyFives on Testing tools
Opening slides to Warsaw Scala FortyFives on Testing tools
 
A Prototype Storage Subsystem based on Phase Change Memory
A Prototype Storage Subsystem based on Phase Change MemoryA Prototype Storage Subsystem based on Phase Change Memory
A Prototype Storage Subsystem based on Phase Change Memory
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
 

Ähnlich wie Beneath RDD in Apache Spark by Jacek Laskowski

Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 

Ähnlich wie Beneath RDD in Apache Spark by Jacek Laskowski (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Apache spark: in and out
Apache spark: in and outApache spark: in and out
Apache spark: in and out
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 

Mehr von Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Kürzlich hochgeladen (20)

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Beneath RDD in Apache Spark by Jacek Laskowski

  • 1. BENEATH RDD IN APACHE SPARK USING SPARK-SHELL AND WEBUI / / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
  • 2. Jacek Laskowski is an independent consultant Contact me at jacek@japila.pl or Delivering Development Services | Consulting | Training Building and leading development teams Mostly and these days Leader of and Blogger at and @JacekLaskowski Apache Spark Scala Warsaw Scala Enthusiasts Warsaw Apache Spark Java Champion blog.jaceklaskowski.pl jaceklaskowski.pl
  • 6. SPARKCONTEXT AND RDDS An RDD belongs to one and only one Spark context. You cannot share RDDs between contexts. SparkContext tracks how many RDDs were created. You may see it in toString output.
  • 9. CREATING RDD - SC.PARALLELIZE sc.parallelize(col, slices)to distribute a local collection of any elements. scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at Alternatively, sc.makeRDD(col, slices)
  • 10. CREATING RDD - SC.RANGE sc.range(start, end, step, slices)to create RDD of long numbers. scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
  • 11. CREATING RDD - SC.TEXTFILE sc.textFile(name, partitions)to create a RDD of lines from a file. scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
  • 12. CREATING RDD - SC.WHOLETEXTFILES sc.wholeTextFiles(name, partitions)to create a RDD of pairs of a file name and its content from a directory. scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
  • 13. There are many more more advanced functions in SparkContextto create RDDs.
  • 14. PARTITIONS (AND SLICES) Did you notice the words slices and partitions as parameters? Partitions (aka slices) are the level of parallelism. We're going to talk about the level of parallelism later.
  • 15. CREATING RDD - DATAFRAMES RDDs are so last year :-) Use DataFrames...early and often! A DataFrame is a higher-level abstraction over RDDs and semi-structured data. DataFrames require a SQLContext.
  • 16. FROM RDDS TO DATAFRAMES scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int] scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
  • 17. ...AND VICE VERSA scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int] scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
  • 19. CREATING DATAFRAMES - SQLCONTEXT.READ sqlContext.readis the modern yet experimental way. sqlContext.read.format(f).load(path), where f is: jdbc json orc parquet text
  • 21. PARTITIONS AND LEVEL OF PARALLELISM The number of partitions of a RDD is (roughly) the number of tasks. Partitions are the hint to size jobs. Tasks are the smallest unit of execution. Tasks belong to TaskSets. TaskSets belong to Stages. Stages belong to Jobs. Jobs, stages, and tasks are displayed in web UI. We're going to talk about the web UI later.
  • 22. PARTITIONS AND LEVEL OF PARALLELISM CD. In local[*] mode, the number of partitions equals the number of cores (the default in spark-shell) scala> sc.defaultParallelism res0: Int = 8 scala> sc.master res1: String = local[*] Not necessarily true when you use local or local[n] master URLs.
  • 23. LEVEL OF PARALLELISM IN SPARK CLUSTERS TaskScheduler controls the level of parallelism DAGScheduler, TaskScheduler, SchedulerBackend work in tandem DAGScheduler manages a "DAG" of RDDs (aka RDD lineage) SchedulerBackends manage TaskSets
  • 26. RDD LINEAGE RDD lineage is a graph of RDD dependencies. Use toDebugString to know the lineage. Be careful with the hops - they introduce shuffle barriers. Why is the RDD lineage important? This is the R in RDD - resiliency. But deep lineage costs processing time, doesn't it? Persist (aka cache) it early and often!
  • 27. RDD LINEAGE - DEMO What does the following do? val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
  • 28. RDD LINEAGE - DEMO CD. How many stages are there? // val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 [] Nothing happens yet - processing time-wise.
  • 29. SPARK CLUSTERS Spark supports the following clusters: one-JVM local cluster Spark Standalone Apache Mesos Hadoop YARN You use --master to select the cluster spark://hostname:port is for Spark Standalone And you know the local master URL, ain't you? local, local[n], or local[*]
  • 30. MANDATORY PROPERTIES OF SPARK APP Your task: Fill in the gaps below. Any Spark application must specify application name (aka appName ) and master URL. Demo time! => spark-shell is a Spark app, too!
  • 31. SPARK STANDALONE CLUSTER The built-in Spark cluster Start standalone Master with sbin/start-master Use -h to control the host name to bind to. Start standalone Worker with sbin/start-slave Run single worker per machine (aka node) = web UI for Standalone cluster Don't confuse it with the web UI of Spark application Demo time! => Run Standalone cluster http://localhost:8080/
  • 33. SPARK-SHELL AND SPARK STANDALONE You can connect to Spark Standalone using spark-shell through --master command-line option. Demo time! => we've already started the Standalone cluster.
  • 34. WEBUI WEB USER INTERFACE FOR SPARK APPLICATION
  • 35. WEBUI It is available under You can disable it using spark.ui.enabled flag. All the events are captured by Spark listeners You can register your own Spark listener. Demo time! => webUI in action with different master URLs http://localhost:4040/
  • 36. QUESTIONS? - Visit - Follow at twitter - Use - Read notes. Jacek Laskowski's blog @jaceklaskowski Jacek's projects at GitHub Mastering Apache Spark