SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Dive into Spark 2
We will talk about
• Scala
• Spark intro
• RDD in depth
• Spark architecture
• Shuffle and how it works
• Dive into the Dataset API
• Spark streaming
Spark
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•Provides high level tools:
•Spark SQL.
•MLib.
•GraphX.
•Spark Streaming.
RDD
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a collection of items which their source may for example:
•Hadoop (HDFS).
•JDBC.
•ElasticSearch.
•And more…
D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
D is for dependency
• RDD can depend on other RDDs.
• RDD is lazy.
• For example rdd.map(String::toUppercase)
• Creates a new RDD which depends on the original one.
• Contains only meta-data (i.e., the computing function).
• Only on a specific command (knows as actions, like collect) the
flow will be computed.
The famous word count example
sc.textFile("src/main/resources/books.txt")
.flatMap(line => line.split(" "))
.map(w => (w,1))
.reduceByKey((c1, c2) => c1 + c2)
RDD from Collection
•You can create an RDD from a collection:
sc.parallelize(list)
•Takes a sequence from the driver and distributes it across the
nodes.
•Note, the distribution is lazy so be careful with mutable
collections!
•Important, if it’s a range collection, use the range method as it
does not create the collection on the driver.
RDD from file
•Spark supports reading files, directories and compressed files.
•The following out-of-the-box methods:
•textFile – retrieving an RDD[String] (lines).
•wholeTextFiles – retrieving an RDD[(String, String)] with
filename and content.
•sequenceFile – Hadoop sequence files RDD[(K,V)].
RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
Per-partition Operations
•Spark provides several API for per-partition operations:
•mapPartitions.
•mapPartitionsWithIndex.
•You need to provide a function of type:
•Iterator[A] => Iterator[B].
•Example use-case: sort by values.
Caching
•One of the strongest features of Spark is the ability to cache an
RDD.
•Spark can cache the items of the RDD in memory or on disk.
•You can avoid expensive re-calculations this way.
•The cache() and persist() store the content in memory.
•You can provide a different storage level by supplying it to the
persist method.
Caching
•One of the strongest features of Spark is the ability to cache an
RDD.
•You can store the RDD in memory, disk or try in memory and if it
fails fallback to disk.
•In the memory it can be stored either deserialized or serialized.
•Note that serialized is more space-efficient but CPU works harder.
•You can also specify that the cache will be replicated, resulting in a
faster cache in case of partition loss.
In-Memory Caching
•When using memory as a cache storage, you have the option of
either using serialized or de-serialized state.
•Serialized state occupies less memory at the expense of more CPU.
•However, even in a de-serialized mode, Spark still needs to figure
the size of the RDD partition.
•Spark uses the SizeEstimator utility for this purpose.
•Sometimes can be quite expensive (for complicated data structures).
Unpersist
•Caching is based on LRU.
•If you don’t need a cache, remove it with the unpersist()
method.
Map Reduce
• Map-reduce was introduced by Google and it is an interface that
can break a task to sub-tasks distribute them to be executed in
parallel (map) , and aggregate the results (reduce).
• Between the Map and Reduce parts, an alternative phase
called ‘shuffe’ can be introduced.
• In the Shuffle phase, the output of the Map operations is sorted
and aggregated for the Reduce part. (Hadoop)
Shuffle
•Shuffle operations repartition the data across the network.
•Can be very expensive operations in Spark.
•You must be aware where and why shuffle happens.
•Order is not guaranteed inside a partition.
•Popular operations that cause shuffle are: groupBy*,
reduceBy*, sort*, aggregateBy* and join/intersect operations on
multiple RDDs.
Everyday I’m shuffling
Image taken from https://www.datastax.com/wp-content/uploads/2015/05/SparkShuffle.png
What actually happens in Shuffle?
Buffer
Block Files
Shuffle write –
shuffle map tasks
Shuffle read –
waits for all the
shuffle map tasks
to finish
AppendOnlyMap
if there is no space
ExternalAppendOnlyMap
Transformation that shuffle
*Taken from the official Apache Spark documentation
•distinct([numTasks]) - Return a new dataset that contains the
distinct elements of the source dataset.
•groupByKey([numTasks]) - When called on a dataset of (K,
V) pairs, returns a dataset of (K, Iterable<V>) pairs.
•reduceByKey(func, [numTasks]) -When called on a
dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce
function func, which must be of type (V,V) => V.
Transformation that shuffle
*Taken from the official Apache Spark documentation
•join(otherDataset, [numTasks]) - When called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
•sort, sortByKey…
•More at http://spark.apache.org/docs/latest/programming-
guide.html
How does it work?
•Cluster manager
•Responsible for the resource allocation
•Cluster management
•Orchestrates work
•Worker nodes:
•Tracks and spawn application processes
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
How does it work?
•Driver:
•Executes the main program
•Creates the RDDs
•Collects the results
•Executors:
•Executes the RDD operations
•Participate in the shuffle
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
Spark Terminology
•Application – the main program that runs on the driver and
creates the RDDs and collects the results.
•Job – a sequence of transformations on RDD till action occurs.
•Stage – a sequence of transformations on RDD till shuffle
occurs.
•Task – a sequence of transformations on a single partition till
shuffle occurs.
Spark SQL
•Spark module for structured data
•Spark SQL provides a unified engine (catalyst) with 3 APIs:
•SQL. (out of topic today)
•Dataframes (untyped).
•Datasets (typed) – Only Scala and Java for Spark 2.x
DataFrame and DataSets
• Originally Spark provided only DataFrames.
• A DataFrame is conceptually a table with typed columns.
• However, it is not typed at compilation.
• Starting with Spark 1.6, Datasets were introduced.
• Represent a table with columns, but, the row is typed at
compilation.
Spark Session
•The entry-point to the Spark SQL.
•It doesn’t require a SparkContext (is managed within).
•Supports binding to the current thread (Inherited ThreadLocal)
for use with getOrCreate().
•Supports configuration settings (SparkConf, app name and
master).
Creating DataFrames
• DataFrames are created using spark session
• df = spark.read.json(“/data/people.json”)
• df = spark.read.csv(“/data/people.csv”)
• df = spark.read.x..
• Options for constructing the DataFrame
• spark.read.option(“key”,”value”).option…read.csv(..)
• To create a DataFrame from an existing rdd :
• import spark.implicits._
rdd.toDS()
Schema
•Spark knows in many cases to infer the schema of the data by
himself. (By using reflection or sampling)
•Use printSchema() to explore the schema that spark deduce
•Spark also enables to provide the schema yourself
•And to cast types to change existing schema –
df.withColumn(”c1",df(”c1”).cast(IntegerType()))
•
Exploring the DataFrame
•Filtering by columns
•df.filter(“column > 2”)
•df.filter(df.column > 2)
•df.filter(df(“column”) > 2)
•Selecting specific fields:
•df.select(df(“column1”),df(“column2”)):
Exploring the DataFrame
•Aggregations
•df.groupBy(..).sum/count/avg/max/min/.. – aggregation on a
group
•df.agg(min(df(“col”))) – calculates only value
•Aliasing:
•df.alias(“table1”)
•df.withColumnRenamed(“old1”,”new1”)
SQL on DataFrame
•Register DF as an SQL Table
•df.createOrReplaceTempView(“table1”)
•Remember the spark session? SessionBuilder..
•Now we can use sql queries:
•spark.sql(“select count(*) from table1”)
Query planning
•All of the Spark SQL APIs (SQL, Datasets and DataFrames)
result in a Catalyst execution.
•Query execution is composed of 4 stages:
•Logical plan.
•Optimized logical plan.
•Physical plan.
•Code-gen (Tungesten).
Logical plan
•In the first stage, the Catalyst creates a logical execution plan.
•A tree composed of query primitives and combinators that
match the requested query.
Optimized logical plan
•Catalyst runs a set of rules for optimizing the logical plan.
•E.g.:
•Merge projections.
•Push down filters.
•Convert outer-joins to inner-joins.
Tungesten
•Spark 2 introduced a new optimization: Whole stage code
generation.
•It appears that in many scenarios, the bottleneck is the CPU.
•Wasted on virtual calls and cache-line swapping.
•Simple and query specific code should be favored.
Tungesten
•Spark tries to generate a specific Java code for each stage of
the query.
•Using the extremely fast JANINO Java compiler.
•Remember that we want to trade a per-row performance hit for
a single per-query performance hit (even if it is greater).
Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
Spark streaming application
•Define the time window when creating the StreamingContext.
•Define the input sources for the streaming pipeline.
•Combine transformations.
•Define the output operations.
•Invoke start().
•Wait for the computation to stop (awaitTermination()).
Important
•Once start() was invoked, no additional operations can be
added.
•A stopped context can’t be restarted.
Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
Windows
• DStream also supports window operations.
• You can define a sliding window that contains n number of
intervals.
• You’ll get a new DStream that represents the window.
Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
Updating the state
• By default, there is nothing shared between window intervals.
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
Distributed streaming challenges
• Consistency – not always eventual consistency is tolerable
• Fault tolerance – the handling of errors writing to external
storage is on the user
• Out of order
Structured streaming
• Guarantee: at any time, the output of the application is
equivalent to executing a batch job on a prefix of the data.
• Treat all data as an unbound input table
Output modes
• Append: Only new rows
• Complete: the whole result rows
• Update: Only the rows that were updated
Structured streaming
val inputDF = spark.readStream.json(”hdfs://logs")
inputDF.groupBy($”log_level", window($"time", "1 hour")).count()
.writeStream.format("jdbc") .save("jdbc:mysql//…")
Requirements
• Replayable input source - Kafka/Kinesis/Files
• Transactional updates – To avoid inconsistency
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Lightbend
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 

Was ist angesagt? (20)

Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming Applications
 

Ähnlich wie Dive into spark2

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Ähnlich wie Dive into spark2 (20)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Spark core
Spark coreSpark core
Spark core
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Dive into spark2

  • 2. We will talk about • Scala • Spark intro • RDD in depth • Spark architecture • Shuffle and how it works • Dive into the Dataset API • Spark streaming
  • 3. Spark •Spark is a cluster computing engine. •Provides high-level API in Scala, Java, Python and R. •Provides high level tools: •Spark SQL. •MLib. •GraphX. •Spark Streaming.
  • 4. RDD •The basic abstraction in Spark is the RDD. •Stands for: Resilient Distributed Dataset. •It is a collection of items which their source may for example: •Hadoop (HDFS). •JDBC. •ElasticSearch. •And more…
  • 5. D is for Partitioned • Partition is a sub-collection of data that should fit into memory • Partition + transformation = Task • This is the distributed part of the RDD • Partitions are recomputed in case of failure - Resilient Foo bar .. Line 2 Hello … … Line 100.. Line #... … … Line 200.. Line #... … … Line 300.. Line #... …
  • 6. D is for dependency • RDD can depend on other RDDs. • RDD is lazy. • For example rdd.map(String::toUppercase) • Creates a new RDD which depends on the original one. • Contains only meta-data (i.e., the computing function). • Only on a specific command (knows as actions, like collect) the flow will be computed.
  • 7. The famous word count example sc.textFile("src/main/resources/books.txt") .flatMap(line => line.split(" ")) .map(w => (w,1)) .reduceByKey((c1, c2) => c1 + c2)
  • 8. RDD from Collection •You can create an RDD from a collection: sc.parallelize(list) •Takes a sequence from the driver and distributes it across the nodes. •Note, the distribution is lazy so be careful with mutable collections! •Important, if it’s a range collection, use the range method as it does not create the collection on the driver.
  • 9. RDD from file •Spark supports reading files, directories and compressed files. •The following out-of-the-box methods: •textFile – retrieving an RDD[String] (lines). •wholeTextFiles – retrieving an RDD[(String, String)] with filename and content. •sequenceFile – Hadoop sequence files RDD[(K,V)].
  • 10. RDD Actions •Return values by evaluating the RDD (not lazy): •collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. •count() – returns the number of the elements in the RDD. •first() – returns the first element of the RDD. •foreach(f) – performs the function on each element of the RDD.
  • 11. RDD Transformations •Return pointer to new RDD with transformation meta-data •map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
  • 12. Per-partition Operations •Spark provides several API for per-partition operations: •mapPartitions. •mapPartitionsWithIndex. •You need to provide a function of type: •Iterator[A] => Iterator[B]. •Example use-case: sort by values.
  • 13. Caching •One of the strongest features of Spark is the ability to cache an RDD. •Spark can cache the items of the RDD in memory or on disk. •You can avoid expensive re-calculations this way. •The cache() and persist() store the content in memory. •You can provide a different storage level by supplying it to the persist method.
  • 14. Caching •One of the strongest features of Spark is the ability to cache an RDD. •You can store the RDD in memory, disk or try in memory and if it fails fallback to disk. •In the memory it can be stored either deserialized or serialized. •Note that serialized is more space-efficient but CPU works harder. •You can also specify that the cache will be replicated, resulting in a faster cache in case of partition loss.
  • 15. In-Memory Caching •When using memory as a cache storage, you have the option of either using serialized or de-serialized state. •Serialized state occupies less memory at the expense of more CPU. •However, even in a de-serialized mode, Spark still needs to figure the size of the RDD partition. •Spark uses the SizeEstimator utility for this purpose. •Sometimes can be quite expensive (for complicated data structures).
  • 16. Unpersist •Caching is based on LRU. •If you don’t need a cache, remove it with the unpersist() method.
  • 17. Map Reduce • Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map) , and aggregate the results (reduce). • Between the Map and Reduce parts, an alternative phase called ‘shuffe’ can be introduced. • In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part. (Hadoop)
  • 18. Shuffle •Shuffle operations repartition the data across the network. •Can be very expensive operations in Spark. •You must be aware where and why shuffle happens. •Order is not guaranteed inside a partition. •Popular operations that cause shuffle are: groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect operations on multiple RDDs.
  • 19. Everyday I’m shuffling Image taken from https://www.datastax.com/wp-content/uploads/2015/05/SparkShuffle.png
  • 20. What actually happens in Shuffle? Buffer Block Files Shuffle write – shuffle map tasks Shuffle read – waits for all the shuffle map tasks to finish AppendOnlyMap if there is no space ExternalAppendOnlyMap
  • 21. Transformation that shuffle *Taken from the official Apache Spark documentation •distinct([numTasks]) - Return a new dataset that contains the distinct elements of the source dataset. •groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. •reduceByKey(func, [numTasks]) -When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.
  • 22. Transformation that shuffle *Taken from the official Apache Spark documentation •join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. •sort, sortByKey… •More at http://spark.apache.org/docs/latest/programming- guide.html
  • 23. How does it work? •Cluster manager •Responsible for the resource allocation •Cluster management •Orchestrates work •Worker nodes: •Tracks and spawn application processes Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
  • 24. How does it work? •Driver: •Executes the main program •Creates the RDDs •Collects the results •Executors: •Executes the RDD operations •Participate in the shuffle Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
  • 25. Spark Terminology •Application – the main program that runs on the driver and creates the RDDs and collects the results. •Job – a sequence of transformations on RDD till action occurs. •Stage – a sequence of transformations on RDD till shuffle occurs. •Task – a sequence of transformations on a single partition till shuffle occurs.
  • 26. Spark SQL •Spark module for structured data •Spark SQL provides a unified engine (catalyst) with 3 APIs: •SQL. (out of topic today) •Dataframes (untyped). •Datasets (typed) – Only Scala and Java for Spark 2.x
  • 27. DataFrame and DataSets • Originally Spark provided only DataFrames. • A DataFrame is conceptually a table with typed columns. • However, it is not typed at compilation. • Starting with Spark 1.6, Datasets were introduced. • Represent a table with columns, but, the row is typed at compilation.
  • 28. Spark Session •The entry-point to the Spark SQL. •It doesn’t require a SparkContext (is managed within). •Supports binding to the current thread (Inherited ThreadLocal) for use with getOrCreate(). •Supports configuration settings (SparkConf, app name and master).
  • 29. Creating DataFrames • DataFrames are created using spark session • df = spark.read.json(“/data/people.json”) • df = spark.read.csv(“/data/people.csv”) • df = spark.read.x.. • Options for constructing the DataFrame • spark.read.option(“key”,”value”).option…read.csv(..) • To create a DataFrame from an existing rdd : • import spark.implicits._ rdd.toDS()
  • 30. Schema •Spark knows in many cases to infer the schema of the data by himself. (By using reflection or sampling) •Use printSchema() to explore the schema that spark deduce •Spark also enables to provide the schema yourself •And to cast types to change existing schema – df.withColumn(”c1",df(”c1”).cast(IntegerType())) •
  • 31. Exploring the DataFrame •Filtering by columns •df.filter(“column > 2”) •df.filter(df.column > 2) •df.filter(df(“column”) > 2) •Selecting specific fields: •df.select(df(“column1”),df(“column2”)):
  • 32. Exploring the DataFrame •Aggregations •df.groupBy(..).sum/count/avg/max/min/.. – aggregation on a group •df.agg(min(df(“col”))) – calculates only value •Aliasing: •df.alias(“table1”) •df.withColumnRenamed(“old1”,”new1”)
  • 33. SQL on DataFrame •Register DF as an SQL Table •df.createOrReplaceTempView(“table1”) •Remember the spark session? SessionBuilder.. •Now we can use sql queries: •spark.sql(“select count(*) from table1”)
  • 34. Query planning •All of the Spark SQL APIs (SQL, Datasets and DataFrames) result in a Catalyst execution. •Query execution is composed of 4 stages: •Logical plan. •Optimized logical plan. •Physical plan. •Code-gen (Tungesten).
  • 35. Logical plan •In the first stage, the Catalyst creates a logical execution plan. •A tree composed of query primitives and combinators that match the requested query.
  • 36. Optimized logical plan •Catalyst runs a set of rules for optimizing the logical plan. •E.g.: •Merge projections. •Push down filters. •Convert outer-joins to inner-joins.
  • 37. Tungesten •Spark 2 introduced a new optimization: Whole stage code generation. •It appears that in many scenarios, the bottleneck is the CPU. •Wasted on virtual calls and cache-line swapping. •Simple and query specific code should be favored.
  • 38. Tungesten •Spark tries to generate a specific Java code for each stage of the query. •Using the extremely fast JANINO Java compiler. •Remember that we want to trade a per-row performance hit for a single per-query performance hit (even if it is greater).
  • 39. Micro batching with Spark Streaming • Takes a partitioned stream of data • Slices it up by time – usually seconds • DStream – composed of RDD slices that contains a collection of items Taken from official spark documentation
  • 40. Spark streaming application •Define the time window when creating the StreamingContext. •Define the input sources for the streaming pipeline. •Combine transformations. •Define the output operations. •Invoke start(). •Wait for the computation to stop (awaitTermination()).
  • 41. Important •Once start() was invoked, no additional operations can be added. •A stopped context can’t be restarted.
  • 42. Example val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint(checkpoint.toString()) val dstream: DStream[Int] = ssc.textFileStream(s"file://$folder/").map(_.trim.toInt) dstream.print() ssc.start() ssc.awaitTermination()
  • 43. DSTream operations • Similar to RDD operations with small changes •map(func) - returns a new DSTream applying func on every element of the original stream. •filter(func) - returns a new DSTream formed by selecting those elements of the source stream on which func returns true. •reduce(func) – returns a new Dstream of single-element RDDs by applying the reduce func on every source RDD
  • 44. Windows • DStream also supports window operations. • You can define a sliding window that contains n number of intervals. • You’ll get a new DStream that represents the window.
  • 45. Using your existing batch logic • transform(func) - operation that creates a new DStream by a applying func to DStream RDDs. dstream.transform(existingBuisnessFunction)
  • 46. Updating the state • By default, there is nothing shared between window intervals. • How do I accumulate results with the current batch? • updateStateByKey(updateFunc) – a transformation that creates a new DStream with key-value where the value is updated according to the state and the new values. def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = { runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum)) }
  • 47. Checkpoints • Checkpoints – periodically saves to reliable storage (HDFS/S3/…) necessary data to recover from failures • Metadata checkpoints • Configuration of the stream context • DStream definition and operations • Incomplete batches • Data checkpoints • saving stateful RDD data
  • 48. Checkpoints • To configure checkpoint usage : • streamingContext.checkpoint(directory) • To create a recoverable streaming application: • StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
  • 49. Working with the foreach RDD • A common practice is to use the foreachRDD(func) to push data to an external system. • Don’t do: dstream.foreachRDD { rdd => val myExternalResource = ... // Created on the driver rdd.foreachPartition { partition => myExternalResource.save(partition) } }
  • 50. Working with the foreach RDD • Instead do: dstream.foreachRDD { rdd => rdd.foreachPartition { partition => val myExternalResource = ... // Created on the executor myExternalResource.save(partition) } }
  • 51. Distributed streaming challenges • Consistency – not always eventual consistency is tolerable • Fault tolerance – the handling of errors writing to external storage is on the user • Out of order
  • 52. Structured streaming • Guarantee: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. • Treat all data as an unbound input table
  • 53. Output modes • Append: Only new rows • Complete: the whole result rows • Update: Only the rows that were updated
  • 54. Structured streaming val inputDF = spark.readStream.json(”hdfs://logs") inputDF.groupBy($”log_level", window($"time", "1 hour")).count() .writeStream.format("jdbc") .save("jdbc:mysql//…")
  • 55. Requirements • Replayable input source - Kafka/Kinesis/Files • Transactional updates – To avoid inconsistency