SparkSQL and Dataframe

Spark SQL and DataFrame
2015. 8.
이남기 (Nam ge e L e e )
숭실대학교

3 / 30
DataFrame
DataFrame = RDD + Schema
Introduced in Spark 1.3

4 / 30
DataFrame
A distributed collection of rows organized into named columns
An abstraction for selecting, filtering, aggregating and plotting structured data

5 / 30
DataFrame
Write Less Code : Input & Output
DataFrame
Input : JSON
Output : Parquet

6 / 30
DataFrame
Spark SQL’s Data Source API can read and write DataFrame using a variety of formats.

7 / 30
DataFrame
Write Less Code
Likely

8 / 30
DataFrame
Write Less Code : Powerful Operation
Common operations can be expressed concisely as calls to the DataFrame API:
• Selecting required columns
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Filtering

9 / 30
Creating DataFrames
With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data
sources.
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

10 / 30
Creating DataFrames
sources.
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
DataFrameReader
:json
SQLContext:read
Input String path
DataFrame
Name Age
Michael 29
Andy 30
Justin 19

11 / 30
Creating DataFrames
sources.
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
Name Age
Michael 29
Andy 30
Justin 19

12 / 30
DataFrame Operations
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// name (age + 1)
// Michael null
// Andy 31
// Justin 20
// Select people older than 21
df.filter(df("age") > 21).show()
// age name
// 30 Andy
// Count people by age
df.groupBy("age").count().show()
// age count
// null 1
// 19 1
// 30 1
DataFrame(“Colu
mn Name”)
Column Object
• Select, filter, groupBy, join, etc…
Name Age
Michael 29
Andy 30
Justin 19
Output

13 / 30
val sqlContext = ... // An existing SQLContext
val df = sqlContext.sql("SELECT * FROM table")
The sql function on a SQLContext enables applications to run SQL queries programmatically and
returns the result as a DataFrame.
SQLContext:sql
DataFrame
Input String Query

14 / 30
DataFrame
The Scala interface for Spark SQL supports automatically convertingan RDD containing
case classes to a DataFrame// sc is an existing SparkContext.
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
…
Continue

15 / 30
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
DataFrame
Inferring the Schema Using Reflection

16 / 30
// SQL statements can be run by using the sql methods provided by sqlContext.
DataFrame
Michael, 29
Andy, 30
Justin, 19

17 / 30
DataFrame
Michael, 29
Andy, 30
Justin, 19
Name Age
Michael 29
Andy 30
Justin 19
Text RDD
Person Class
RDD DataFrame
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
.toDF()
Array(Michael, 29)
Array(Andy, 30)
Array(Justin, 19)
Person:name Person:age
Michael 29
Andy 30
Justin 19

18 / 30
// SQL statements can be run by using the sql methods provided by sqlContext
DataFrame
Name Age
Michael 29
Andy 30
Justin 19
DataFrame

19 / 30
DataFrame
Programmatically Specifying the Schema
// sc is an existing SparkContext.
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema
val schema = StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
Michael, 29
Andy, 30
Justin, 19
people RDD

20 / 30
DataFrame
// Create an RDD
// Import Row.
(name, age)
schemaString.split(" ")

21 / 30
DataFrame
// Create an RDD
// Import Row.
.map(fieldName => StructField(fieldName, StringType, true))
StructField:”name”
StructField:”age”
(name, age)

22 / 30
DataFrame
// Create an RDD
// Import Row.
.map(fieldName => StructField(fieldName, StringType, true))
StructField:”name”
StructField:”age”
(name, age)
Seq:(StructField:”name”, StructField:”age”)
StructType( …)

23 / 30
DataFrame
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)

24 / 30
DataFrame
Michael, 29
Andy, 30
Justin, 19
people RDD
Row(Michael, 29)
Row(Andy, 30)
Row(Justin, 19)
row RDD

25 / 30
DataFrame
Row(Michael, 29)
Row(Andy, 30)
Row(Justin, 19)
row RDD
Dataframe
name age
Michael 29
Andy 30
Justin 19
Seq:(StructField:”name”, StructField:”age”)
schema

26 / 30
DataFrame
results DataFrame
name age
Michael 29
Andy 30
Justin 19

27 / 30
DataFrame
results DataFrame
name age
Michael 29
Andy 30
Justin 19
t(0) t(1)

28 / 30
DataFrame
results DataFrame
name age
Michael 29
Andy 30
Justin 19
t(0) t(1)
t(0) => Array[Row]( Row(“Michael”), Row(“Andy”), Row(“Justin”) )
t(1) => Array[Row]( Row(29), Row(30), Row(19) )
collect()

29 / 30
Plan Optimization & Execution

30 / 30
Plan Optimization & Execution

32 / 30
Reference
Michael Armbrust (2015), “Spark SQL: Relational Data Processing in Spark”, SIGMOD '15 Proceedings
of the 2015 ACM SIGMOD International Conference on Management of Data Pages 1383-1394
Spark Site, “Spark SQL and DataFrame Guide”, http://spark.apache.org/docs/latest/sql-programming-
guide.html#inferring-the-schema-using-reflection
Youtube, “Spark DataFrames Simple and Fast Analysis of Structured Data - Michael Armbrust
(Databricks)”, https://www.youtube.com/watch?v=xWkJCUcD55w
Blog ,”Spark SQL Internals”, http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html
DataBricks, “Deep Dive into Spark SQL’s Catalyst Optimizer”,
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Spark Site, “Spark API Documations : Scala”,
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

SparkSQL and Dataframe

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SparkSQL and Dataframe

Ähnlich wie SparkSQL and Dataframe (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SparkSQL and Dataframe