SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Big data distributed processing
Spark introduction
INDEXINDEX
1. History of Big Data
2. Apache Spark: basic concepts
3. Spark SQL
4. Spark deployment
1. History of Big Data
Definitions of Big Data:
1. Lots of data! (estimated 44 zettabytes of information in 2020). @me
2. The term that describes a huge amount of data (structured and not
structured) that floods the daily businesses. @SAS
3. 3, 4, 5, 7, 10 Big Data V’s. @ORACLE @IBM
Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, …
History of Big Data
The most accurate definition:
Big Data refers to the systems and technologies needed to obtain, process,
analyze, visualize or get value of the data that cannot be done with the previous
technologies or systems due to its high volume, traffic and volatility.
History of Big Data
Google and the liberation of their technologies:
2003: The Google File System
● Problems with storage starts to arise.
● We have designed and implemented the Google File System (GFS) to meet
the rapidly growing demands of Google’s data processing needs…
2004: MapReduce: Simplified Data Processing on Large Clusters
● Problems with data processing starts to arise.
● Over the past five years, the authors and many others at Google have
implemented hundreds of special-purpose computations that process large
amounts of raw data…
History of Big Data
(2003) The Google File System
Definition of three (four) problems to solve:
1. Servers hardware.
2. File sizes.
3. File usage.
4. Flexible applications.
History of Big Data
(2003) The Google File System
1. Servers hardware:
History of Big Data
Problem Solution
Hardware & Software problems
Component failures
Human errors
Fault tolerance: store multiple copies of the
same file in different servers.
High performance servers costs Horizontal scalability (+1k servers)
Commodity servers (cheap and small)
(2003) The Google File System
2. File sizes:
History of Big Data
Problem Solution
Multiple GB files
(cost of 1GB=10$ in year 2000)
Optimize the reading at the expense of
writing.
Block file size: few KBs
(1000s of millions of reads per file)
Change the block file size to 64-128 MBs
to reduce the amount of reads per file.
(2003) The Google File System
3. File usage:
History of Big Data
Problem Solution
Files are only appended (historic files, audit
files, intermediate calculus, ...).
Sequential read.
Immutable and incremental files: read-only,
append-only.
(2004) MapReduce: Simplified Data Processing on Large Clusters
Distributed computing paradigm:
● Distributed computing using distributed file system.
● Functional computing is parallelizable by design.
● The system must provide:
○ Data movement administration.
○ Distributed execution management.
○ Fault tolerance.
● Clusters of commodity machines processing TB’s of data.
History of Big Data
(2004) MapReduce: Simplified Data Processing on Large Clusters
Definition of two simple operations:
● Map: a given function is applied for each pair of key-value to generate an
intermediate key-value.
● Reduce: a combine function groups each key to calculate the aggregation
of the multiple values associated to the key.
History of Big Data
(2004) MapReduce: Simplified Data Processing on Large Clusters
History of Big Data
Hadoop Distributed File System (HDFS)
● ~2003: started with project Nutch: search engine and crawler.
● They had problems processing data with dozens of servers.
● Google papers gave them a new approach.
● 2006: Yahoo! interested on the project and they took the distributed
computing part of the software, renaming it as Hadoop.
History of Big Data
Hadoop Distributed File System (HDFS)
History of Big Data
2. Apache Spark:
Basic concepts
● Framework for distributed data computing.
● Designed to be executed in large scale clusters with lots of data!
● Run faster than MapReduce (memory usage).
● More functions than just Map and Reduce.
● Multiple APIs, multiple programming languages:
○ Core, SQL, Streaming, GraphX, ML, MlLib, Structured Streaming, …
○ Scala (native), Java, Python, R.
● Runs everywhere:
○ Standalone, YARN, Mesos, Kubernetes, AWS, ...
Apache Spark: basic concepts
● Fault tolerance (RDD).
● Easier resource managing.
● Reusable data: caching.
● Code control and analysis (DAG).
● Generic programming patterns: the same code can run in local mode or
100’s of executors.
● Lazy evaluation: transformations and actions.
Apache Spark: basic concepts
Installation
● GOTO: https://spark.apache.org/downloads.html
● Select version 2.3.2, for Hadoop 2.7 or later.
● Extract it, set SPARK_HOME and add it to PATH.
○ tar xvf spark-2.3.2-bin-hadoop2.7.tgz
○ Add the following lines in .bashrc:
■ export SPARK_HOME=/your/spark/folder
■ export PATH=$PATH:$SPARK_HOME/bin
Apache Spark: basic concepts
Apache Spark: basic concepts
RDD
● The basic abstraction: RDD (Resilient Distributed Datasets)
[1, 3] [2, 5] [4][1, 2, 3, 4, 5]
Partition0
Partition1
Partition2
RDD[Integer]
Apache Spark: basic concepts
RDD
● Distributed: splitted in multiple partitions.
● Resilient: fault tolerance. Data can be reprocessed or duplicated in case of
failure.
● Immutable typed elements: data cannot be modified.
● Data abstraction that allows to natively parallelize them into different
executors.
● Likely to be deprecated soon.
Apache Spark: basic concepts
RDD
val numbers = Seq(1, 2, 3, 4, 5)
val numbersRDD = spark.sparkContext.parallelize(numbers)
numbersRDD: org.apache.spark.rdd.RDD[Int]
Apache Spark: basic concepts
Dataframe
● Distributed collection of Row objects: data organized into columns.
● Data is organized into columns.
name, age
hektor, 26
pepe, 22
jose, 40
Partition0
RDD[Row]
name age
hektor 26
pepe 22
Partition1
name age
jose 40
Apache Spark: basic concepts
Dataframe
● Catalyst: powers the Dataframe and SQL APIs.
1. Analyzing a logical plan to resolve references
2. Logical plan optimization
3. Physical planning
4. Code generation to compile parts of the query to Java bytecode.
● Tungsten: provides a physical execution backend which explicitly manages
memory and dynamically generates bytecode for expression evaluation.
Apache Spark: basic concepts
Dataframe
val rowNumbersRDD = numbersRDD.map(v => Row(v))
val schema = StructType(Seq(
StructField("id", IntegerType, false)
))
val numbersDF = spark.createDataFrame(rowNumbersRDD, schema)
Apache Spark: basic concepts
Dataset
● Extension of the DataFrame API: combines the power of strong typing and
lambda functions in RDD and the execution engine of the Dataframe APIs.
● Most used and powerful API.
● Encoders: allows to work with structured and unstructured data.
● The encoders provide type-safe and object-oriented programming.
Apache Spark: basic concepts
Dataset
val numbersDS = Seq(1, 2, 3).toDS
[...]
case class Person(name:String, age:Integer)
val personList = Seq(Person("Hektor", 26), Person("Pepe",22))
val personDS = personList.toDS
val personDS = spark.createDataset(personList)
[...]
Apache Spark: basic concepts
Spark operation types
● Transformations:
○ Operations that creates a new RDD, usually based on a previous one.
○ Does not evaluate the expression until an action is called.
○ Spark is able to infer the output type.
○ You can concatenate multiple transformations, before an action.
● Actions:
○ Operations that evaluates all the transformations defined.
○ Forces the evaluation to save or use the result data.
Apache Spark: basic concepts
Transformation types
Apache Spark: basic concepts
● Map map[U] (f: (T)=>U)
Applies the given function to all single elements. Can modify the values or
return a different type.
val myF = (v:Int) => v + 1
val newRDD : RDD[Int] = numbersRDD.map(myF)
val strRDD : RDD[String] = numbersRDD.map(v => v.toString)
val toLong = numbersRDD.map(_.toLong)
Narrow transformations
3 “3”
Apache Spark: basic concepts
● flatMap flatMap[U](f: (T) => TransversableOnce[U])
Applies the given function to all single elements and then flattens. Can
modify the values, return a different type, return multiple values or none.
val myF = (v:Int) => Seq(v + 1)
val newRDD = numbersRDD.flatMap(myF)
val strRDD = numbersRDD.flatMap(v => Seq(v.toString))
val toLong = numbersRDD.flatMap(_.toLong::Nil)
Narrow transformations
Apache Spark: basic concepts
● flatMap flatMap[U](f: (T) => TransversableOnce[U])
numbersRDD.flatMap(v=> 1 to v)
Narrow transformations
1
2
3
4
5
[1]
[1,2]
[1,2,3]
[1,2,3,4]
[1,2,3,4,5]
1,1,2
1,2,3
1,2,3
4,1,2
3,4,5
Apache Spark: basic concepts
● mapPartitions mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U])
Similar to map, but applying the function to the whole partition.
numbersRDD.mapPartitions(v => v.map(_ + 1))
● filter filter(f: (T) ⇒ Boolean)
Obtains a new RDD with those elements succeeded the predicate
val myF = (v:Int) => v%2==0
val evenNumbers = numbersRDD.filter(myF)
Narrow transformations
Apache Spark: basic concepts
● groupBy groupBy[(K, Iterable[T])](f: (T) ⇒ K, p: Partitioner)
Obtains a new RDD grouped by key.
val myF = (v:Int) => v % 2
val groupedRDD = numbersRDD.groupBy(myF)
Wide transformations
1
2
3
4
5
[0,[2,4]]
[1,[1,3,5]]
Apache Spark: basic concepts
● repartition(n)
Rearranges the RDD to match the new number of partitions with equal size
of partitions.
● coalesce(n, shuffle : Boolean = false)
Rearranges the RDD to match the new number of partitions with equal size
of partitions. If shuffle is false, you can only reduce the number of partitions
and the transformation will be narrow.
Wide transformations
[1,2]
[3,4]
[5,6]
[1,2,3,4]
[5,6]
coalesce(2, false)
Apache Spark: basic concepts
● coalesce(n, shuffle : Boolean = false)
Wide transformations
[1,2]
[3,4]
[5,6]
[1,2,3]
[4,5,6]
coalesce(2, true)
repartition(2)
(shuffle = wide)
RDD[Int](3) RDD[Int](2)
Apache Spark: basic concepts
The RDD automatically cast to PairRDD when a (K,V) is detected.
New transformations are available:
val strings = "tomato apple apple pear tomato"
val stringsRDD = spark.sparkContext.parallelize(strings.split(" "))
val pairs = stringsRDD.map(v => (v,1))
val countPairs = pairs.reduceByKey(_ + _)
● mapValues, flatMapValues, sortByKey, countByKey, foldByKey, ...
Key-value transformations
Apache Spark: basic concepts
● foldByKey foldByKey(zero: V)(f: (V, V) ⇒ V): RDD[(K, V)]
Applies the function for each partition using the zero value as first value.
val sumFunc = (v1:Int, v2:Int) => v1 + v2
val countPairs = pairs.foldByKey(0)(_ + _) // or sumFunc
● aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒
U)
Like foldByKey, but combOp can be a different operation.
* These functions have their RDD counterparts (without ByKey)
Key-value transformations
Apache Spark: basic concepts
foldByKey and aggregateByKey
Apache Spark: basic concepts
Executes the current function and all the previous ones defined in the DAG.
Spark Driver data collection
These functions get the data from executors into the driver.
● collect: the driver obtains all the information. This is a really dangerous
operation that could kill the driver if you handle huge amounts of data.
● take(n), takeOrdered(n), first: take the first n results to the driver.
● count: counts the number of elements.
Actions
Apache Spark: basic concepts
Data usage or movement
● foreach(f: [T] => Unit): applies the function to each element. Cannot
modify the values (use map instead).
Use this function for side effects instead of map.
rdd.foreach { v =>
doSomethingWithIt(v) // good behaviour
v + 1 // this won’t modify the value
}
rdd.map { v =>
doSomethingWithIt(v) // bad behaviour
v + 1 // this will modify the value
}
Actions
Apache Spark: basic concepts
Data usage or movement
● saveAs*
Saves the RDD into multiple sinks, like Hadoop or FileSystem.
Spark’s RDD API is not commonly used nowadays, and it should be only used
when you cannot use Datasets API.
dataset.rdd().makeRDDThings.toDS()
Actions
Apache Spark: basic concepts
One of the most important things in Spark. This allows us to reuse intermediate
results to optimize its usage.
● cache():
Persist this RDD with the default storage level (MEMORY_ONLY).
● persist(newLevel: StorageLevel)
Set this RDD's storage level to persist its values across operations after the
first time it is computed.
Persistence and caching
Apache Spark: basic concepts
Storage Levels:
● MEMORY_ONLY
● MEMORY_AND_DISK
● MEMORY_ONLY_SER
● MEMORY_AND_DISK_SER
● DISK_ONLY
● All of the above with _2
● OFF_HEAP
Persistence and caching
Apache Spark: basic concepts
1. Read log file
2. Print total number of lines, chars and show first few lines.
3. Separate the line into a tuple and persist it.
4. Calculate the number of logs per country. Show top 3.
5. Calculate the number of logs per type.
6. Check when the errors starts happening.
7. Do the same with Datasets
Painful RDD example
Apache Spark: basic concepts
Complete the following exercise
Exercise
Apache Spark: basic concepts
Persistence and caching
read file
print lines
count lines
count chars
obtain tuples
logs per
country
logs per type
min of date
ask file system where to
get the data and obtain it
Apache Spark: basic concepts
Wordcount example
3. Spark SQL
Spark SQL
Cleaner API than the RDD:
● Data inference: strong-type safe in compile time!!!
● Business intelligence easier than ever!
● Throw some SQL if you are stuck!!
spark.read.text("./myfile.txt").createOrReplaceTempView("data")
val data = spark.sql("SELECT * FROM data")
data.show()
Datasets
Spark SQL
● read: reads the data from the selected datasource.
If the data is partitioned, Spark will have its job done!
spark.read.options(MapWithOptions).csv
parquet
json
text
jdbc
orc
[...]
Data usage or movement (Dataset)
Spark SQL
● write: saves the data into the selected datasource.
Each partition is saved separately, unless you coalesce or repartition first.
The output will create parts inside the designed folder, and they can be
read natively with spark reading the parent folder.
ds.write.options(
Map("header"->"true", delimiter"->"|")
).csv("file:/tmp/test")
Data usage or movement (Dataset)
Spark SQL
● select(“col1”, “col2”:*): obtain the columns selected from the dataset.
● drop(“col1”, “col2”:*): drop the columns selected from the dataset.
● filter(expr): obtains the rows that satisfies the expression.
● union(ds): combines with other dataset with similar schema.
● dropDuplicates(“col1”, “col2”:*): drop the duplicated rows considering the
columns given.
● except(ds): obtains the columns not seen in the other dataset.
● join(ds, joinExprs, joinType): the good ol’ join.
● groupBy(“cols”): groups the dataset with the given columns. You can now
use aggregation functions with agg(f1,f2,f3,...)
[...]
Dataset transformations
Spark SQL
Complete the following exercise
Exercise
4. Spark deployment
● Driver: in charge of managing the job sent to the executors.
● Executors: in charge of running the tasks from the job.
● Cluster Manager: provides connectivity and resources to the driver needs.
○ Standalone
○ Mesos
○ Yarn
○ Kubernetes (k8s)
● SparkContext / SparkSession: must be created to run Spark jobs. Configures the
context where the jobs will be run (number and location of executors, resources, …).
Only one SparkSession can exist per JVM.
Spark deployment
Components
Spark deployment
Components
Spark deployment
Components
Application lifecycle
● The Spark job is packaged into a JAR, usually an “uber JAR” w/o Spark or Hadoop
dependencies (they are added in runtime).
● To launch the job in the Cluster Manager varies between them. Usually, you have an
endpoint where you can submit your jobs.
./bin/spark-submit --master spark://host:port [...]
● This JAR is received from the Cluster Manager, who proceeds to launch a Driver
program containing the JAR.
● The driver starts the job, and ask the Cluster Manager for the resources needed for
the executors.
● The driver connects to the executors, sends the JAR and the executors starts
receiving tasks until the job is finished.
Spark deployment
Components
Job components
● Task: a unit of work that will be sent to one executor.
● Job: a parallel computation consisting of multiple tasks that gets spawned in
response to a Spark action (e.g. save, collect); you'll see this term used in the driver's
logs.
● Stage: each job gets divided into smaller sets of tasks called stages that depend on
each other (similar to the map and reduce stages in MapReduce).
Books...
people@stratio.com
WE ARE HIRING
@StratioBD
Thanks!
hjacynycz@stratio.com

Weitere ähnliche Inhalte

Was ist angesagt?

Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose themDatio Big Data
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 
Evolving as a professional software developer
Evolving as a professional software developerEvolving as a professional software developer
Evolving as a professional software developerAnton Kirillov
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 

Was ist angesagt? (20)

Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Evolving as a professional software developer
Evolving as a professional software developerEvolving as a professional software developer
Evolving as a professional software developer
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 

Ähnlich wie Big data distributed processing Spark introduction

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 

Ähnlich wie Big data distributed processing Spark introduction (20)

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 

Kürzlich hochgeladen

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 

Kürzlich hochgeladen (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 

Big data distributed processing Spark introduction

  • 1. Big data distributed processing Spark introduction
  • 2. INDEXINDEX 1. History of Big Data 2. Apache Spark: basic concepts 3. Spark SQL 4. Spark deployment
  • 3. 1. History of Big Data
  • 4. Definitions of Big Data: 1. Lots of data! (estimated 44 zettabytes of information in 2020). @me 2. The term that describes a huge amount of data (structured and not structured) that floods the daily businesses. @SAS 3. 3, 4, 5, 7, 10 Big Data V’s. @ORACLE @IBM Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, … History of Big Data
  • 5. The most accurate definition: Big Data refers to the systems and technologies needed to obtain, process, analyze, visualize or get value of the data that cannot be done with the previous technologies or systems due to its high volume, traffic and volatility. History of Big Data
  • 6.
  • 7. Google and the liberation of their technologies: 2003: The Google File System ● Problems with storage starts to arise. ● We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs… 2004: MapReduce: Simplified Data Processing on Large Clusters ● Problems with data processing starts to arise. ● Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data… History of Big Data
  • 8. (2003) The Google File System Definition of three (four) problems to solve: 1. Servers hardware. 2. File sizes. 3. File usage. 4. Flexible applications. History of Big Data
  • 9. (2003) The Google File System 1. Servers hardware: History of Big Data Problem Solution Hardware & Software problems Component failures Human errors Fault tolerance: store multiple copies of the same file in different servers. High performance servers costs Horizontal scalability (+1k servers) Commodity servers (cheap and small)
  • 10. (2003) The Google File System 2. File sizes: History of Big Data Problem Solution Multiple GB files (cost of 1GB=10$ in year 2000) Optimize the reading at the expense of writing. Block file size: few KBs (1000s of millions of reads per file) Change the block file size to 64-128 MBs to reduce the amount of reads per file.
  • 11. (2003) The Google File System 3. File usage: History of Big Data Problem Solution Files are only appended (historic files, audit files, intermediate calculus, ...). Sequential read. Immutable and incremental files: read-only, append-only.
  • 12. (2004) MapReduce: Simplified Data Processing on Large Clusters Distributed computing paradigm: ● Distributed computing using distributed file system. ● Functional computing is parallelizable by design. ● The system must provide: ○ Data movement administration. ○ Distributed execution management. ○ Fault tolerance. ● Clusters of commodity machines processing TB’s of data. History of Big Data
  • 13. (2004) MapReduce: Simplified Data Processing on Large Clusters Definition of two simple operations: ● Map: a given function is applied for each pair of key-value to generate an intermediate key-value. ● Reduce: a combine function groups each key to calculate the aggregation of the multiple values associated to the key. History of Big Data
  • 14. (2004) MapReduce: Simplified Data Processing on Large Clusters History of Big Data
  • 15. Hadoop Distributed File System (HDFS) ● ~2003: started with project Nutch: search engine and crawler. ● They had problems processing data with dozens of servers. ● Google papers gave them a new approach. ● 2006: Yahoo! interested on the project and they took the distributed computing part of the software, renaming it as Hadoop. History of Big Data
  • 16. Hadoop Distributed File System (HDFS) History of Big Data
  • 18. ● Framework for distributed data computing. ● Designed to be executed in large scale clusters with lots of data! ● Run faster than MapReduce (memory usage). ● More functions than just Map and Reduce. ● Multiple APIs, multiple programming languages: ○ Core, SQL, Streaming, GraphX, ML, MlLib, Structured Streaming, … ○ Scala (native), Java, Python, R. ● Runs everywhere: ○ Standalone, YARN, Mesos, Kubernetes, AWS, ... Apache Spark: basic concepts
  • 19. ● Fault tolerance (RDD). ● Easier resource managing. ● Reusable data: caching. ● Code control and analysis (DAG). ● Generic programming patterns: the same code can run in local mode or 100’s of executors. ● Lazy evaluation: transformations and actions. Apache Spark: basic concepts
  • 20. Installation ● GOTO: https://spark.apache.org/downloads.html ● Select version 2.3.2, for Hadoop 2.7 or later. ● Extract it, set SPARK_HOME and add it to PATH. ○ tar xvf spark-2.3.2-bin-hadoop2.7.tgz ○ Add the following lines in .bashrc: ■ export SPARK_HOME=/your/spark/folder ■ export PATH=$PATH:$SPARK_HOME/bin Apache Spark: basic concepts
  • 21. Apache Spark: basic concepts RDD ● The basic abstraction: RDD (Resilient Distributed Datasets) [1, 3] [2, 5] [4][1, 2, 3, 4, 5] Partition0 Partition1 Partition2 RDD[Integer]
  • 22. Apache Spark: basic concepts RDD ● Distributed: splitted in multiple partitions. ● Resilient: fault tolerance. Data can be reprocessed or duplicated in case of failure. ● Immutable typed elements: data cannot be modified. ● Data abstraction that allows to natively parallelize them into different executors. ● Likely to be deprecated soon.
  • 23. Apache Spark: basic concepts RDD val numbers = Seq(1, 2, 3, 4, 5) val numbersRDD = spark.sparkContext.parallelize(numbers) numbersRDD: org.apache.spark.rdd.RDD[Int]
  • 24. Apache Spark: basic concepts Dataframe ● Distributed collection of Row objects: data organized into columns. ● Data is organized into columns. name, age hektor, 26 pepe, 22 jose, 40 Partition0 RDD[Row] name age hektor 26 pepe 22 Partition1 name age jose 40
  • 25. Apache Spark: basic concepts Dataframe ● Catalyst: powers the Dataframe and SQL APIs. 1. Analyzing a logical plan to resolve references 2. Logical plan optimization 3. Physical planning 4. Code generation to compile parts of the query to Java bytecode. ● Tungsten: provides a physical execution backend which explicitly manages memory and dynamically generates bytecode for expression evaluation.
  • 26. Apache Spark: basic concepts Dataframe val rowNumbersRDD = numbersRDD.map(v => Row(v)) val schema = StructType(Seq( StructField("id", IntegerType, false) )) val numbersDF = spark.createDataFrame(rowNumbersRDD, schema)
  • 27. Apache Spark: basic concepts Dataset ● Extension of the DataFrame API: combines the power of strong typing and lambda functions in RDD and the execution engine of the Dataframe APIs. ● Most used and powerful API. ● Encoders: allows to work with structured and unstructured data. ● The encoders provide type-safe and object-oriented programming.
  • 28. Apache Spark: basic concepts Dataset val numbersDS = Seq(1, 2, 3).toDS [...] case class Person(name:String, age:Integer) val personList = Seq(Person("Hektor", 26), Person("Pepe",22)) val personDS = personList.toDS val personDS = spark.createDataset(personList) [...]
  • 29. Apache Spark: basic concepts Spark operation types ● Transformations: ○ Operations that creates a new RDD, usually based on a previous one. ○ Does not evaluate the expression until an action is called. ○ Spark is able to infer the output type. ○ You can concatenate multiple transformations, before an action. ● Actions: ○ Operations that evaluates all the transformations defined. ○ Forces the evaluation to save or use the result data.
  • 30. Apache Spark: basic concepts Transformation types
  • 31. Apache Spark: basic concepts ● Map map[U] (f: (T)=>U) Applies the given function to all single elements. Can modify the values or return a different type. val myF = (v:Int) => v + 1 val newRDD : RDD[Int] = numbersRDD.map(myF) val strRDD : RDD[String] = numbersRDD.map(v => v.toString) val toLong = numbersRDD.map(_.toLong) Narrow transformations 3 “3”
  • 32. Apache Spark: basic concepts ● flatMap flatMap[U](f: (T) => TransversableOnce[U]) Applies the given function to all single elements and then flattens. Can modify the values, return a different type, return multiple values or none. val myF = (v:Int) => Seq(v + 1) val newRDD = numbersRDD.flatMap(myF) val strRDD = numbersRDD.flatMap(v => Seq(v.toString)) val toLong = numbersRDD.flatMap(_.toLong::Nil) Narrow transformations
  • 33. Apache Spark: basic concepts ● flatMap flatMap[U](f: (T) => TransversableOnce[U]) numbersRDD.flatMap(v=> 1 to v) Narrow transformations 1 2 3 4 5 [1] [1,2] [1,2,3] [1,2,3,4] [1,2,3,4,5] 1,1,2 1,2,3 1,2,3 4,1,2 3,4,5
  • 34. Apache Spark: basic concepts ● mapPartitions mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U]) Similar to map, but applying the function to the whole partition. numbersRDD.mapPartitions(v => v.map(_ + 1)) ● filter filter(f: (T) ⇒ Boolean) Obtains a new RDD with those elements succeeded the predicate val myF = (v:Int) => v%2==0 val evenNumbers = numbersRDD.filter(myF) Narrow transformations
  • 35. Apache Spark: basic concepts ● groupBy groupBy[(K, Iterable[T])](f: (T) ⇒ K, p: Partitioner) Obtains a new RDD grouped by key. val myF = (v:Int) => v % 2 val groupedRDD = numbersRDD.groupBy(myF) Wide transformations 1 2 3 4 5 [0,[2,4]] [1,[1,3,5]]
  • 36. Apache Spark: basic concepts ● repartition(n) Rearranges the RDD to match the new number of partitions with equal size of partitions. ● coalesce(n, shuffle : Boolean = false) Rearranges the RDD to match the new number of partitions with equal size of partitions. If shuffle is false, you can only reduce the number of partitions and the transformation will be narrow. Wide transformations [1,2] [3,4] [5,6] [1,2,3,4] [5,6] coalesce(2, false)
  • 37. Apache Spark: basic concepts ● coalesce(n, shuffle : Boolean = false) Wide transformations [1,2] [3,4] [5,6] [1,2,3] [4,5,6] coalesce(2, true) repartition(2) (shuffle = wide) RDD[Int](3) RDD[Int](2)
  • 38. Apache Spark: basic concepts The RDD automatically cast to PairRDD when a (K,V) is detected. New transformations are available: val strings = "tomato apple apple pear tomato" val stringsRDD = spark.sparkContext.parallelize(strings.split(" ")) val pairs = stringsRDD.map(v => (v,1)) val countPairs = pairs.reduceByKey(_ + _) ● mapValues, flatMapValues, sortByKey, countByKey, foldByKey, ... Key-value transformations
  • 39. Apache Spark: basic concepts ● foldByKey foldByKey(zero: V)(f: (V, V) ⇒ V): RDD[(K, V)] Applies the function for each partition using the zero value as first value. val sumFunc = (v1:Int, v2:Int) => v1 + v2 val countPairs = pairs.foldByKey(0)(_ + _) // or sumFunc ● aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U) Like foldByKey, but combOp can be a different operation. * These functions have their RDD counterparts (without ByKey) Key-value transformations
  • 40. Apache Spark: basic concepts foldByKey and aggregateByKey
  • 41. Apache Spark: basic concepts Executes the current function and all the previous ones defined in the DAG. Spark Driver data collection These functions get the data from executors into the driver. ● collect: the driver obtains all the information. This is a really dangerous operation that could kill the driver if you handle huge amounts of data. ● take(n), takeOrdered(n), first: take the first n results to the driver. ● count: counts the number of elements. Actions
  • 42. Apache Spark: basic concepts Data usage or movement ● foreach(f: [T] => Unit): applies the function to each element. Cannot modify the values (use map instead). Use this function for side effects instead of map. rdd.foreach { v => doSomethingWithIt(v) // good behaviour v + 1 // this won’t modify the value } rdd.map { v => doSomethingWithIt(v) // bad behaviour v + 1 // this will modify the value } Actions
  • 43. Apache Spark: basic concepts Data usage or movement ● saveAs* Saves the RDD into multiple sinks, like Hadoop or FileSystem. Spark’s RDD API is not commonly used nowadays, and it should be only used when you cannot use Datasets API. dataset.rdd().makeRDDThings.toDS() Actions
  • 44. Apache Spark: basic concepts One of the most important things in Spark. This allows us to reuse intermediate results to optimize its usage. ● cache(): Persist this RDD with the default storage level (MEMORY_ONLY). ● persist(newLevel: StorageLevel) Set this RDD's storage level to persist its values across operations after the first time it is computed. Persistence and caching
  • 45. Apache Spark: basic concepts Storage Levels: ● MEMORY_ONLY ● MEMORY_AND_DISK ● MEMORY_ONLY_SER ● MEMORY_AND_DISK_SER ● DISK_ONLY ● All of the above with _2 ● OFF_HEAP Persistence and caching
  • 46. Apache Spark: basic concepts 1. Read log file 2. Print total number of lines, chars and show first few lines. 3. Separate the line into a tuple and persist it. 4. Calculate the number of logs per country. Show top 3. 5. Calculate the number of logs per type. 6. Check when the errors starts happening. 7. Do the same with Datasets Painful RDD example
  • 47. Apache Spark: basic concepts Complete the following exercise Exercise
  • 48. Apache Spark: basic concepts Persistence and caching read file print lines count lines count chars obtain tuples logs per country logs per type min of date ask file system where to get the data and obtain it
  • 49. Apache Spark: basic concepts Wordcount example
  • 51. Spark SQL Cleaner API than the RDD: ● Data inference: strong-type safe in compile time!!! ● Business intelligence easier than ever! ● Throw some SQL if you are stuck!! spark.read.text("./myfile.txt").createOrReplaceTempView("data") val data = spark.sql("SELECT * FROM data") data.show() Datasets
  • 52. Spark SQL ● read: reads the data from the selected datasource. If the data is partitioned, Spark will have its job done! spark.read.options(MapWithOptions).csv parquet json text jdbc orc [...] Data usage or movement (Dataset)
  • 53. Spark SQL ● write: saves the data into the selected datasource. Each partition is saved separately, unless you coalesce or repartition first. The output will create parts inside the designed folder, and they can be read natively with spark reading the parent folder. ds.write.options( Map("header"->"true", delimiter"->"|") ).csv("file:/tmp/test") Data usage or movement (Dataset)
  • 54. Spark SQL ● select(“col1”, “col2”:*): obtain the columns selected from the dataset. ● drop(“col1”, “col2”:*): drop the columns selected from the dataset. ● filter(expr): obtains the rows that satisfies the expression. ● union(ds): combines with other dataset with similar schema. ● dropDuplicates(“col1”, “col2”:*): drop the duplicated rows considering the columns given. ● except(ds): obtains the columns not seen in the other dataset. ● join(ds, joinExprs, joinType): the good ol’ join. ● groupBy(“cols”): groups the dataset with the given columns. You can now use aggregation functions with agg(f1,f2,f3,...) [...] Dataset transformations
  • 55. Spark SQL Complete the following exercise Exercise
  • 57. ● Driver: in charge of managing the job sent to the executors. ● Executors: in charge of running the tasks from the job. ● Cluster Manager: provides connectivity and resources to the driver needs. ○ Standalone ○ Mesos ○ Yarn ○ Kubernetes (k8s) ● SparkContext / SparkSession: must be created to run Spark jobs. Configures the context where the jobs will be run (number and location of executors, resources, …). Only one SparkSession can exist per JVM. Spark deployment Components
  • 59. Spark deployment Components Application lifecycle ● The Spark job is packaged into a JAR, usually an “uber JAR” w/o Spark or Hadoop dependencies (they are added in runtime). ● To launch the job in the Cluster Manager varies between them. Usually, you have an endpoint where you can submit your jobs. ./bin/spark-submit --master spark://host:port [...] ● This JAR is received from the Cluster Manager, who proceeds to launch a Driver program containing the JAR. ● The driver starts the job, and ask the Cluster Manager for the resources needed for the executors. ● The driver connects to the executors, sends the JAR and the executors starts receiving tasks until the job is finished.
  • 60. Spark deployment Components Job components ● Task: a unit of work that will be sent to one executor. ● Job: a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. ● Stage: each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce).
  • 62.