The document provides an overview of the different Spark APIs for working with structured data: RDDs, DataFrames, and Datasets. It discusses the timeline and key features of each API. RDDs were introduced in Spark 1.0 and represent resilient distributed datasets. DataFrames were added in Spark 1.3 and introduce schema support and SQL-like capabilities. Datasets, introduced in Spark 1.6, provide a type-safe interface but are still experimental. DataFrames are now considered the most stable and flexible API due to built-in optimizations and support for dynamic languages.
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
1. Making sense of RDDs, DataFrames, SparkSQL and
Datasets APIs
Motivation
overview over the different Spark APIs for working with structured data.
Timeline of Spark APIs
Spark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well known
concept from R / Python Pandas.
Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-oriented
programming interface.
RDD
RDD - Resilient Distributed Dataset
Functional transformations on partitioned collections of opaque objects.
Define case class representing schema of our data.
Each field represent column of the DB.
>
defined class Person
Create parallelized collection (RDD)
>
peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95
RDD of type Person
>
rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98
NB: Return Person objects
>
Person(Sven,38)
case class Person(name: String, age: Int)
val peopleRDD = sc.parallelize(Array(
Person("Lars", 37),
Person("Sven", 38),
Person("Florian", 39),
Person("Dieter", 37)
))
val rdd = peopleRDD
.filter(_.age > 37)
rdd
.collect
.foreach(println(_))
(http://databricks.com) Import Notebook
RDDs, SQL, DataFrames and DataSets
2. Person(Florian,39)
DataFrames
Declarative transformations on partitioned collection of tuples.
>
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
>
+-------+---+
| name|age|
+-------+---+
| Lars| 37|
| Sven| 38|
|Florian| 39|
| Dieter| 37|
+-------+---+
>
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
Show only age column
>
+---+
|age|
+---+
| 37|
| 38|
| 39|
| 37|
+---+
NB: Result set consists of Arrays of String und Ints
>
[Sven,38]
[Florian,39]
DataSets
Create DataFrames from RDDs
Implicit conversion is also available
>
peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
NB: Result set consist of Person objects
val peopleDF = peopleRDD.toDF
peopleDF.show()
peopleDF.printSchema()
peopleDF.select("age").show()
peopleDF
.filter("age > 37")
.collect
.foreach(row => println(row))
val peopleDS = peopleRDD.toDS
3. >
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
>
Person(Sven,38)
Person(Florian,39)
>
res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
>
res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
>
Person(Sven,38)
Person(Florian,39)
Spark SQL
>
import org.apache.spark.sql.SQLContext
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129
Register DataFrame for usage via SQL
>
The results of SQL queries are DataFrames and support all the usual RDD operations.
>
res298: org.apache.spark.sql.DataFrame = [name: string, age: int]
Print execution plan
>
res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- Subquery sparkpeopletbl
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
Optimized by built-in optimizer execution plan
val ds = peopleDS
.filter(_.age > 37)
ds.collect
.foreach(println(_))
ds.queryExecution.analyzed
ds.queryExecution.optimizedPlan
ds.collect
.foreach(println(_))
// Get SQL context from Spark context
// NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor.
In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access the
shared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())."
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
peopleDF.registerTempTable("sparkPeopleTbl")
sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
4. >
res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
>
Sven 38
Florian 39
>
res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39])
>
res302: Array[String] = Array(NAME: Sven, NAME: Florian)
>
res303: Array[String] = Array(NAME: Sven, NAME: Florian)
>
res304: org.apache.spark.sql.DataFrame = [name: string, age: int]
Running SQL queries agains Parquet files directly
>
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95
MiB]
>
0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01-
13T11:08:00.000+0000
19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01-
13T10:33:00.000+0000
D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01-
13T04:58:00.000+0000
67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01-
name age
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan
%sql SELECT * FROM sparkPeopleTbl WHERE age > 37
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect()
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/")
%sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
5. Showing the first 1000 rows.
13T04:58:00.000+0000
5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01-
13T08:56:00.000+0000
Conclusions
RDDs
RDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should
be prefered.
DataFrames
DataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside
this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and
flexible API.
Datasets
Datasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in
experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster
serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with
structured data, and Spark plans to converge the APIs even further.
Here some benchmarks of DataSets:
Transformation of Data Types in Spark