4. 4 / 30
DataFrame
A distributed collection of rows organized into named columns
An abstraction for selecting, filtering, aggregating and plotting structured data
8. 8 / 30
DataFrame
Write Less Code : Powerful Operation
Common operations can be expressed concisely as calls to the DataFrame API:
• Selecting required columns
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Filtering
9. 9 / 30
Creating DataFrames
With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data
sources.
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
10. 10 / 30
Creating DataFrames
With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data
sources.
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
DataFrameReader
:json
SQLContext:read
Input String path
DataFrame
Name Age
Michael 29
Andy 30
Justin 19
11. 11 / 30
Creating DataFrames
With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data
sources.
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
Name Age
Michael 29
Andy 30
Justin 19
12. 12 / 30
DataFrame Operations
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// name (age + 1)
// Michael null
// Andy 31
// Justin 20
// Select people older than 21
df.filter(df("age") > 21).show()
// age name
// 30 Andy
// Count people by age
df.groupBy("age").count().show()
// age count
// null 1
// 19 1
// 30 1
DataFrame(“Colu
mn Name”)
Column Object
DataFrame Operations
• Select, filter, groupBy, join, etc…
Name Age
Michael 29
Andy 30
Justin 19
Output
13. 13 / 30
DataFrame Operations
val sqlContext = ... // An existing SQLContext
val df = sqlContext.sql("SELECT * FROM table")
The sql function on a SQLContext enables applications to run SQL queries programmatically and
returns the result as a DataFrame.
SQLContext:sql
DataFrame
Input String Query
14. 14 / 30
DataFrame
The Scala interface for Spark SQL supports automatically convertingan RDD containing
case classes to a DataFrame// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
…
Continue
15. 15 / 30
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
DataFrame
Inferring the Schema Using Reflection
16. 16 / 30
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
DataFrame
Michael, 29
Andy, 30
Justin, 19
Inferring the Schema Using Reflection
17. 17 / 30
DataFrame
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
Michael, 29
Andy, 30
Justin, 19
Name Age
Michael 29
Andy 30
Justin 19
Text RDD
Person Class
RDD DataFrame
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
.toDF()
Inferring the Schema Using Reflection
Array(Michael, 29)
Array(Andy, 30)
Array(Justin, 19)
Person:name Person:age
Michael 29
Person:name Person:age
Andy 30
Person:name Person:age
Justin 19
18. 18 / 30
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
DataFrame
Inferring the Schema Using Reflection
Name Age
Michael 29
Andy 30
Justin 19
DataFrame
19. 19 / 30
DataFrame
Programmatically Specifying the Schema
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema = StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
Michael, 29
Andy, 30
Justin, 19
people RDD
20. 20 / 30
DataFrame
Programmatically Specifying the Schema
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema = StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
(name, age)
schemaString.split(" ")
21. 21 / 30
DataFrame
Programmatically Specifying the Schema
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema = StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
.map(fieldName => StructField(fieldName, StringType, true))
StructField:”name”
StructField:”age”
(name, age)
schemaString.split(" ")
22. 22 / 30
DataFrame
Programmatically Specifying the Schema
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema = StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
.map(fieldName => StructField(fieldName, StringType, true))
StructField:”name”
StructField:”age”
(name, age)
schemaString.split(" ")
Seq:(StructField:”name”, StructField:”age”)
StructType( …)
23. 23 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
24. 24 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
Michael, 29
Andy, 30
Justin, 19
people RDD
Row(Michael, 29)
Row(Andy, 30)
Row(Justin, 19)
row RDD
25. 25 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
Row(Michael, 29)
Row(Andy, 30)
Row(Justin, 19)
row RDD
Dataframe
name age
Michael 29
Andy 30
Justin 19
Seq:(StructField:”name”, StructField:”age”)
schema
26. 26 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
results DataFrame
name age
Michael 29
Andy 30
Justin 19
27. 27 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
results DataFrame
name age
Michael 29
Andy 30
Justin 19
t(0) t(1)
28. 28 / 30
DataFrame
Programmatically Specifying the Schema
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
results DataFrame
name age
Michael 29
Andy 30
Justin 19
t(0) t(1)
t(0) => Array[Row]( Row(“Michael”), Row(“Andy”), Row(“Justin”) )
t(1) => Array[Row]( Row(29), Row(30), Row(19) )
collect()
32. 32 / 30
Reference
Michael Armbrust (2015), “Spark SQL: Relational Data Processing in Spark”, SIGMOD '15 Proceedings
of the 2015 ACM SIGMOD International Conference on Management of Data Pages 1383-1394
Spark Site, “Spark SQL and DataFrame Guide”, http://spark.apache.org/docs/latest/sql-programming-
guide.html#inferring-the-schema-using-reflection
Youtube, “Spark DataFrames Simple and Fast Analysis of Structured Data - Michael Armbrust
(Databricks)”, https://www.youtube.com/watch?v=xWkJCUcD55w
Blog ,”Spark SQL Internals”, http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html
DataBricks, “Deep Dive into Spark SQL’s Catalyst Optimizer”,
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Spark Site, “Spark API Documations : Scala”,
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package