2. Agenda
Hadoop vs Spark: Big ‘Big Data’ question
Spark Ecosystem
What is RDD
Operations on RDD: Actions vs
Transformations
Running in cluster
Task schedulers
Spark Streaming
Dataframes API
9. RDD: Resilient Distributed Dataset
Represents an immutable, partitioned collection of elements that can be
operated in parallel with failure recovery possibilities.
10. Example
Hadoop RDD
getPartitions = HDFS blocks
getDependencies = None
compute = load block in memory
getPrefferedLocations = HDFS block locations
partitioner = None
MapPartitions RDD
getPartitions = same as parent
getDependencies = parent RDD
compute = compute parent and apply map()
getPrefferedLocations = same as parent
partitioner = None
14. RDD Operations
● Transformations
○ Apply user function to every element in a partition
○ Apply aggregation function to a whole dataset
(groupBy, sortBy)
○ Provide functionality for repartitioning (repartition,
partitionBy)
● Actions
○ Materialize computation results (collect, count,
take)
○ Store RDDs in memory or on disk (cache, persist)
16. DAG: Directed Acyclic Graph
All the operators in a job
are used to construct a
DAG (Directed Acyclic
Graph). The DAG is
optimized by rearranging
and combining operators
where possible.
18. DAG Scheduler
The DAG scheduler divides
operators into stages of
tasks. A stage is comprised
of tasks based on partitions
of the input data. Pipelines
operators together.
20. RDD Persistence: persist() & cache()
When you persist an RDD, each node stores any partitions of it that it computes in memory
and reuses them in other actions on that dataset (or datasets derived from it).
Storage levels: MEMORY_ONLY (default), MEMORY_AND_DISK,
MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY,
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.
22. Task Schedulers
Standalone
Default
FIFO strategy
Controls number of CPU
cores and executor
memory
YARN
Hadoop oriented
Takes all available
resources
Was designed for
stateless batch jobs
that can be restarted
easily if they fail.
Mesos
Resource oriented
Dynamic sharing or CPU
cores
Less predictive latency
25. Memory usage
• Execution memory
• Storage for data needed during tasks execution
• Shuffle-related data
• Storage memory
• Cached RDDs
• Possible to borrow from execution memory
• User memory
• User data structures and internal metadata
• Safeguarding against OOM
• Reserved memory
• Memory needed for running executor itself
28. Spark Streaming: Architecture
Spark Streaming receives live input data streams and divides the data into
batches, which are then processed by the Spark engine to generate the final
stream of results in batches.
31. Spark Streaming checkpoints
• Create heavy objects in foreachRDD
• Default persistence level of DStreams keeps the data serialized in memory.
• Checkpointing (metadata and received data)
• Automatic restart (task manager)
• Max receiving rate
• Level of Parallelism
• Kryo serialization
34. Apache Hive
• Hadoop product
• Stores metadata in the relational database, but data only in HDFS
• Is not suited for real time data processing
• Best used for batch jobs over large datasets of immutable data (web logs)
Is a good choice if you:
• Want to query the data
• When you’re familiar with SQL
35. About Spark SQL
Part of Spark core since April 2014
Works with structured data
Mixes SQL queries with Spark programs
Connect to any datasource (files, Hive
tables, external databases, RDDs)
Исполнение в кластере
Параллельность
Отказоустойчивость
Скорость
Различные форматы данных
Мониторинг и распределение ресурсов
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
Пример: вычитываем данные из файла, берем имя работника, его должность, зарплату и возраст. Фильтруем по нужным должностям. Потом хоть аггрегировать: среднюю зп по должности и по возрасту.
window length - The duration of the window (3 in the figure).
sliding interval - The interval at which the window operation is performed (2 in the figure).