SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Downloaden Sie, um offline zu lesen
SPARK SQL
By
Eng. Joud Khattab
Content
1. Introduction.
2. Conceptual concepts:
– RDD, Dataset and DataFrame, Hive Database.
3. Spark SQL The whole story.
4. How does it all work?
5. Spark in R:
– Sparklyr Library.
6. Example.
7. References.
2By Joud Khattab
Spark SQL
■ Spark SQL is a Spark module for structured data processing.
■ Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD.
3By Joud Khattab
Spark SQL
■ Spark SQL was first released in Spark 1.0 (May, 2014).
■ Initial committed by Michael Armbrust & Reynold Xin from Databricks.
■ Spark introduces a programming module for structured data processing called
Spark SQL.
■ It provides a programming abstraction called DataFrame and can act as distributed
SQL query engine.
4By Joud Khattab
Challenges and Solutions
Challenges
■ Perform ETL to and from various
(semi- or unstructured) data sources.
■ Perform advanced analytics (e.g.
machine learning, graph processing)
that are hard to express in relational
systems.
Solutions
■ A DataFrame API that can perform
relational operations on both external
data sources and Spark’s built-in RDDs.
■ A highly extensible optimizer, Catalyst,
that uses features of Scala to add
composable rule, control code gen., and
define extensions.
5By Joud Khattab
Spark SQL Architecture
6By Joud Khattab
Spark SQL Architecture
■ Language API:
– Spark is compatible with different languages and Spark SQL.
– It is also, supported by these languages- API (python, scala, java, HiveQL).
■ Schema RDD:
– Spark Core is designed with special data structure called RDD.
– Generally, Spark SQL works on schemas, tables, and records.
– Therefore, we can use the Schema RDD as temporary table.
– We can call this Schema RDD as Data Frame.
■ Data Sources:
– Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
Sources for Spark SQL is different.
– Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
7By Joud Khattab
Features of Spark SQL
1. Integrated:
– Seamlessly mix SQL queries with Spark programs.
– Spark SQL lets you query structured data as a distributed dataset (RDD) in
Spark, with integrated APIs in Python, Scala and Java.
– This tight integration makes it easy to run SQL queries alongside complex
analytic algorithms.
2. Unified Data Access:
– Load and query data from a variety of sources.
– Schema-RDDs provide a single interface for efficiently working with structured
data, including Apache Hive tables, parquet files and JSON files.
8By Joud Khattab
Features of Spark SQL
3. Hive Compatibility:
– Run unmodified Hive queries on existing warehouses.
– Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility
with existing Hive data, queries, and UDFs.
– Simply install it alongside Hive.
9
SELECT
COUNT(*)
FROM
hiveTable
WHERE
hive_udf(data)
By Joud Khattab
Features of Spark SQL
4. Standard Connectivity:
– Connect through JDBC or ODBC.
– Spark SQL includes a server mode with industry standard JDBC and ODBC
connectivity.
10By Joud Khattab
Features of Spark SQL
5. Scalability:
– Use the same engine for both interactive and long queries.
– Spark SQL takes advantage of the RDD model to support mid-query fault
tolerance, letting it scale to large jobs too.
– Do not worry about using a different engine for historical data.
11By Joud Khattab
SPARK RDD
Resilient Distributed Datasets
12By Joud Khattab
SPARK RDD
(Resilient Distributed Datasets)
■ RDD is a fundamental data structure of Spark.
■ It is an immutable distributed collection of objects that can be stored in memory or
disk across a cluster.
■ Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster.
■ Parallel functional transformations (map, filter, …).
■ Automatically rebuilt on failure.
■ RDDs can contain any type of Python, Java, or Scala objects, including user-defined
classes.
13By Joud Khattab
SPARK RDD
(Resilient Distributed Datasets)
■ Formally, an RDD is a read-only, partitioned collection of records.
■ RDDs can be created through deterministic operations on either data on stable
storage or other RDDs.
■ RDD is a fault-tolerant collection of elements that can be operated on in parallel.
■ There are two ways to create RDDs:
– parallelizing an existing collection in your driver program.
– referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
■ Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they
are not so efficient.
14By Joud Khattab
SPARK SQL
DATASET AND DATAFRAME
15By Joud Khattab
Dataset and DataFrame
■ A distributed collection of data, which is organized into named columns.
■ Conceptually, it is equivalent to relational tables with good optimization techniques.
■ A DataFrame can be constructed from an array of different sources such as Hive
tables, Structured Data files, external databases, or existing RDDs.
■ This API was designed for modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas in Python.
16By Joud Khattab
Dataset and DataFrame
■ DataFrame
– Data is organized into named columns, like a table in a relational database
■ Dataset: a distributed collection of data
– A new interface added in Spark 1.6
– Static-typing and runtime type-safety
17By Joud Khattab
Features of DataFrame
■ Ability to process the data in the size of Kilobytes to Petabytes on a single node
cluster to large cluster.
■ Supports different data formats (Avro, csv, elastic search, and Cassandra) and
storage systems (HDFS, HIVE tables, mysql, etc).
■ State of art optimization and code generation through the Spark SQL Catalyst
optimizer (tree transformation framework).
■ Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
■ Provides API for Python, Java, Scala, and R Programming.
18By Joud Khattab
SPARK SQL & HIVE
19By Joud Khattab
Hive Compatibility
■ Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with
existing Hive data, queries, and UDFs.
20By Joud Khattab
Hive
■ A database/data warehouse on top of Hadoop
– Rich data types
– Efficient implementations of SQL on top of map reduce
■ Support Analysis of large datasets stored in Hadoop's HDFS and compatible file
systems
– Such as Amazon S3 filesystem.
■ Provides an SQL-like language called HiveQL with schema.
21By Joud Khattab
Hive Architecture
■ User issues SQL query
■ Hive parses and plans query
■ Query converted to Map-Reduce
■ Map-Reduce runs by Hadoop
22By Joud Khattab
User-Defined Functions
■ UDF: Plug in your own processing code and invoke it from a Hive query
– UDF (Plain UDF)
■ Input: single row, Output: single row
– UDAF (User-Defined Aggregate Function)
■ Input: multiple rows, Output: single row
■ e.g. COUNT and MAX
– UDTF (User-Defined Table-generating Function)
■ Input: single row, Output: multiple rows
23By Joud Khattab
SPARK SQL
THE WHOLE STORY
24By Joud Khattab
The not-so-secret truth…
25
SQL
is not about SQL.
Decelerative Programing
By Joud Khattab
Spark SQL The whole story
■ Create and Run Spark Programs Faster:
1. Write less code.
2. Read less data.
3. Let the optimizer do the hard work.
■ RDD V.S. Dataframe.
26By Joud Khattab
Write Less Code:
Input & Output
■ Unified interface to reading/writing data in a variety of formats:
27
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
read and write
functions create
new builders for
doing I/O
By Joud Khattab
Write Less Code:
Input & Output
■ Unified interface to reading/writing data in a variety of formats:
28
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
Builder methods
are used to specify:
• Format
• Partitioning
• Handling of
existing data
• and more
By Joud Khattab
Write Less Code:
Input & Output
■ Unified interface to reading/writing data in a variety of formats:
29
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
load(…), save(…) or
saveAsTable(…)
functions create
new builders for
doing I/O
By Joud Khattab
Read Less Data:
Efficient Formats
■ Parquet is an efficient columnar storage format:
– Compact binary encoding with intelligent compression (delta, RLE, etc).
– Each column stored separately with an index that allows skipping of unread
columns.
– Data skipping using statistics (column min/max, etc).
30By Joud Khattab
Write Less Code:
Powerful Operations
■ Common operations can be expressed concisely as calls to the DataFrame API:
– Selecting required columns.
– Joining different data sources.
– Aggregation (count, sum, average, etc).
– Filtering.
– Plotting results.
31By Joud Khattab
Write Less Code:
Compute an Average
32
private IntWritable one = new IntWritable(1)
private IntWritable output = new IntWritable()
proctected void map( LongWritable key,
Text value, Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1) DoubleWritable average
= new DoubleWritable()
protected void reduce( IntWritable key,
Iterable<IntWritable> values, Context
context) {
int sum = 0 int count = 0
for(IntWritable value : values) { sum +=
value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
By Joud Khattab
Write Less Code:
Compute an Average
33
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
By Joud Khattab
Not Just Less Code:
Faster Implementations
34
0 2 4 6 8 10
DataFrameSQL
DataFramePython
DataFrameScala
RDDPython
RDDScala
Time to Aggregate 10 million int pairs (secs)
By Joud Khattab
Plan Optimization & Execution
35
SQLAST
DataFrame
Unresolved
Logical
Plan
Logical
Plan
Optimized
Logical
Plan
RDDs
Selected
Physical
Plan
Analysis Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code
Generation
By Joud Khattab
36
def add_demographics(events):
u = sqlCtx.table("users") # Load Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # udf adds city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "Amsterdam")
.select(events.timestamp).collect()
Logical Plan
filter
join
events file
users
table
expensive
only join
relevantusers
Physical Plan
join
scan
(events)
filter
scan
(users)
By Joud Khattab
HOW DOES IT ALL WORK?
37By Joud Khattab
An example query
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
38
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
By Joud Khattab
Naïve Query Planning
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
39
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
Physical
Plan
Project
name
Filter
id = 1
Project
id,name
TableScan
People
By Joud Khattab
Optimized Execution
■ Writing imperative code to optimize all
possible patterns is hard.
■ Instead write simple rules:
– Each rule makes one change
– Run many rules together to fixed
point.
40
IndexLookup
id = 1
return: name
Physical
Plan
Logical
Plan
Project
name
Filter
id = 1
Project
id,name
People
By Joud Khattab
Writing Rules as Tree Transformations
1. Find filters on top of projections.
2. Check that the filter can be evaluated
without the result of the project.
3. If so, switch the operators.
41
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
By Joud Khattab
Optimizing with Rules
42
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
Physical
Plan
By Joud Khattab
SPARKLYR
R interface for Apache Spark
43By Joud Khattab
Sparklyr
■ Founded on September 2016.
■ Connect to Spark from R.
■ Provides a complete dplyr backend.
■ Filter and aggregate Spark datasets
then bring them into R for analysis and
visualization.
■ Use Sparks distributed machine
learning library from R.
44By Joud Khattab
Manipulating Data with dplyr
■ dplyr is an R package for working with structured data both in and outside of R.
■ dplyr makes data manipulation for R users easy, consistent, and performant.
■ With dplyr as an interface to manipulating Spark DataFrames, you can:
– Select, filter, and aggregate data.
– Use window functions (e.g. for sampling).
– Perform joins on DataFrames.
– Collect data from Spark into R.
45By Joud Khattab
Reading Data
■ You can read data into Spark DataFrames using the following functions:
– Spark_read_csv.
– Spark_read_json.
– Spark_read_parquet.
■ Regardless of the format of your data, Spark supports reading data from a variety of
different data sources. These include data stored on HDFS, Amazon S3, or local files
available to the spark worker nodes.
■ Each of these functions returns a reference to a Spark DataFrame which can be
used as a dplyr table.
46By Joud Khattab
Flights Data
■ This guide will demonstrate some of the basic data manipulation verbs of dplyr by
using data from the nycflights13 R package.
■ This package contains data for all 336,776 flights departing New York City in 2013.
It also includes useful metadata on airlines, airports, weather, and planes.
■ Connect to the cluster and copy the flights data using the copy_to function.
■ Note:
– The flight data in nycflights13 is convenient for dplyr demonstrations because it
is small, but in practice large data should rarely be copied directly from R
objects.
47By Joud Khattab
Flights Data
library(sparklyr)
library(dplyr)
library(nycflights13)
library(ggplot2)
sc <- spark_connect(master="local")
flights <- copy_to(sc, flights, "flights")
airlines <- copy_to(sc, airlines, "airlines")
src_tbls(sc)
[1] "airlines" "flights"
48By Joud Khattab
dplyr Verbs
■ Verbs are dplyr commands for manipulating data.
■ When connected to a Spark DataFrame, dplyr translates the commands into Spark
SQL statements.
■ Remote data sources use exactly the same five verbs as local data sources.
■ Here are the five verbs with their corresponding SQL commands:
– Select ~ select
– Filter ~ where
– Arrange ~ order
– Summarise ~ aggregators: sum, min, sd, etc.
– Mutate ~ operators: +. *, log, etc.
49By Joud Khattab
dplyr Verbs:
select
select(flights, year:day, arr_delay,
dep_delay)
# Source: lazy query [?? x 5]
# Database: spark_connection
year month day arr_delay dep_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 11 2
2 2013 1 1 20 4
3 2013 1 1 33 2
4 2013 1 1 -18 -1
5 2013 1 1 -25 -6
6 2013 1 1 12 -4
7 2013 1 1 19 -5
8 2013 1 1 -14 -3
9 2013 1 1 -8 -3
10 2013 1 1 8 -2
# ... with 3.368e+05 more rows
50By Joud Khattab
dplyr Verbs:
filter
filter(flights, dep_delay > 1000) # Source: lazy query [?? x 19]
# Database: spark_connection
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 9 641 900 1301
2 2013 1 10 1121 1635 1126
3 2013 6 15 1432 1935 1137
4 2013 7 22 845 1600 1005
5 2013 9 20 1139 1845 1014
# ... with 13 more variables: arr_time <int>, sched_arr_time
# <int>, arr_delay <dbl>, carrier <chr>, flight <int>, talinum
# <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance
# <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
51By Joud Khattab
dplyr Verbs:
arrange
arrange(flights, desc(dep_delay)) # Source: table<flights> [?? x 19]
# Database: spark_connection
# Ordered by: desc(dep_delay)
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 9 641 900 1301
2 2013 6 15 1432 1935 1137
3 2013 1 10 1121 1635 1126
4 2013 9 20 1139 1845 1014
5 2013 7 22 845 1600 1005
6 2013 4 10 1100 1900 960
7 2013 3 17 2321 810 911
8 2013 6 27 959 1900 899
9 2013 7 22 2257 759 898
10 2013 12 5 756 1700 896
# ... With 3.368e+05 more rows, and 13 more variables:
# arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest
# <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
# <dbl>, time_hour <dbl>
52By Joud Khattab
dplyr Verbs:
summarise
summarise(flights, mean_dep_delay =
mean(dep_delay))
# Source: lazy query [?? x 1]
# Database: spark_connection
mean_dep_delay
<dbl>
1 12.63907
53By Joud Khattab
dplyr Verbs:
mutate
mutate(flights, speed = distance /
air_time * 60)
Source: query [3.368e+05 x 4]
Database: spark connection master=local[4]
app=sparklyr local=TRUE
# A tibble: 3.368e+05 x 4
year month day speed
<int> <int> <int> <dbl>
1 2013 1 1 370.0441
2 2013 1 1 374.2731
3 2013 1 1 408.3750
4 2013 1 1 516.7213
5 2013 1 1 394.1379
6 2013 1 1 287.6000
7 2013 1 1 404.4304
8 2013 1 1 259.2453
9 2013 1 1 404.5714
10 2013 1 1 318.6957
# ... with 3.368e+05 more rows
54By Joud Khattab
Laziness
■ When working with databases, dplyr tries to be as lazy as possible:
– It never pulls data into R unless you explicitly ask for it.
– It delays doing any work until the last possible moment: it collects together
everything you want to do and then sends it to the database in one step.
■ For example, take the following code:
– c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL’))
– c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
– c3 <- arrange(c2, year, month, day, carrier)
– c4 <- mutate(c3, air_time_hours = air_time / 60)
55By Joud Khattab
Laziness
■ This sequence of operations never
actually touches the database.
■ It’s not until you ask for the data (e.g.
by printing c4) that dplyr requests the
results from the database.
# Source: lazy query [?? x 20]
# Database: spark_connection
year month day carrier dep_delay air_time distance air_time_hours
<int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 5 17 AA -2 294 2248 4.900000
2 2013 5 17 AA -1 146 1096 2.433333
3 2013 5 17 AA -2 185 1372 3.083333
4 2013 5 17 AA -9 186 1389 3.100000
5 2013 5 17 AA 2 147 1096 2.450000
6 2013 5 17 AA -4 114 733 1.900000
7 2013 5 17 AA -7 117 733 1.950000
8 2013 5 17 AA -7 142 1089 2.366667
9 2013 5 17 AA -6 148 1089 2.466667
10 2013 5 17 AA -7 137 944 2.283333
# ... with more rows
56By Joud Khattab
Piping
■ You can use magrittr pipes to write cleaner syntax. Using the same example from
above, you can write a much cleaner version like this:
– c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
arrange(carrier) %>%
mutate(air_time_hours = air_time / 60)
57By Joud Khattab
Grouping
c4 %>%
group_by(carrier) %>%
summarize(count = n(), mean_dep_delay =
mean(dep_delay))
Source: query [?? x 3]
Database: spark connection master=local
app=sparklyr local=TRUE
# S3: tbl_spark
carrier count mean_dep_delay
<chr> <dbl> <dbl>
1 AA 94 1.468085
2 UA 172 9.633721
3 WN 34 7.970588
4 DL 136 6.235294
58By Joud Khattab
Collecting to R
■ You can copy data from Spark into R’s memory by using collect().
■ collect() executes the Spark query and returns the results to R for further analysis
and visualization.
59By Joud Khattab
Collecting to R
carrierhours <- collect(c4)
# Test the significance of pairwise
differences and plot the results
with(carrierhours, pairwise.t.test(air_time,
carrier))
Pairwise comparisons using t tests with
pooled SD
data: air_time and carrier
AA DL UA
DL 0.25057 - -
UA 0.07957 0.00044 -
WN 0.07957 0.23488 0.00041
P value adjustment method: holm
60By Joud Khattab
Collecting to R
ggplot(carrierhours, aes(carrier,
air_time_hours)) + geom_boxplot()
61By Joud Khattab
Window Functions
■ dplyr supports Spark SQL window functions.
■ Window functions are used in conjunction with mutate and filter to solve a wide
range of problems.
■ You can compare the dplyr syntax to the query it has generated by using
sql_render().
62By Joud Khattab
Window Functions
# Rank each flight within a daily
ranked <- flights %>%
group_by(year, month, day) %>%
select(dep_delay) %>%
mutate(rank = rank(desc(dep_delay)))
sql_render(ranked)
<SQL> SELECT `year`, `month`, `day`,
`dep_delay`, rank() OVER (PARTITION BY
`year`, `month`, `day` ORDER BY
`dep_delay` DESC) AS `rank`
FROM (SELECT `year` AS `year`, `month`
AS `month`, `day` AS `day`, `dep_delay`
AS `dep_delay` FROM `flights`)
`uflidyrkpj`
65By Joud Khattab
Window Functions
ranked # Source: lazy query [?? x 20]
# Database: spark_connection
year month day dep_delay rank
<int> <int> <int> <dbl> <int>
1 2013 1 5 327 1
2 2013 1 5 257 2
3 2013 1 5 225 3
4 2013 1 5 128 4
5 2013 1 5 127 5
6 2013 1 5 117 6
7 2013 1 5 111 7
8 2013 1 5 108 8
9 2013 1 5 105 9
10 2013 1 5 101 10
# ... with more rows
66By Joud Khattab
Performing Joins
■ It’s rare that a data analysis involves only a single table of data.
■ In practice, you’ll normally have many tables that contribute to an analysis, and you need
flexible tools to combine them.
■ In dplyr, there are three families of verbs that work with two tables at a time:
– Mutating joins, which add new variables to one table from matching rows in
another.
– Filtering joins, which filter observations from one table based on whether or not
they match an observation in the other table.
– Set operations, which combine the observations in the data sets as if they were set
elements.
■ All two-table verbs work similarly. The first two arguments are x and y, and provide the
tables to combine. The output is always a new table with the same type as x.
67By Joud Khattab
Performing Joins
The following statements are equivalent:
flights %>% left_join(airlines)
flights %>% left_join(airlines, by =
"carrier")
# Source: lazy query [?? x 20]
# Database: spark_connection
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with more rows
68By Joud Khattab
Sampling
■ You can use sample_n() and sample_frac() to take a random sample of rows:
– use sample_n() for a fixed number and sample_frac() for a fixed fraction.
■ Ex:
– sample_n(flights, 10)
– sample_frac(flights, 0.01)
69By Joud Khattab
SPARK SQL EXAMPLE
Analysis of babynames with dplyr
70By Joud Khattab
Analysis of babynames with dplyr
1. Setup.
2. Connect to Spark.
3. Total US births.
4. Aggregate data by name.
5. Most popular names (1986).
6. Most popular names (2014).
7. Shared names.
71By Joud Khattab
References
1. https://spark.apache.org/docs/latest/sql-programming-guide.html
2. http://spark.rstudio.com/
3. https://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune
4. http://www.tutorialspoint.com/spark_sql/
5. https://www.youtube.com/watch?v=A7Ef_ZB884g
72By Joud Khattab
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

Was ist angesagt? (20)

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Ähnlich wie Spark SQL

Ähnlich wie Spark SQL (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
 
Big data overview
Big data overviewBig data overview
Big data overview
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark core
Spark coreSpark core
Spark core
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Mehr von Joud Khattab

Mehr von Joud Khattab (20)

Customer Engagement Management
Customer Engagement ManagementCustomer Engagement Management
Customer Engagement Management
 
Design thinking and Role Playing
Design thinking and Role PlayingDesign thinking and Role Playing
Design thinking and Role Playing
 
Algorithms and Data Structure 2020
Algorithms and Data Structure 2020Algorithms and Data Structure 2020
Algorithms and Data Structure 2020
 
Artificial Intelligence 2020
Artificial Intelligence 2020Artificial Intelligence 2020
Artificial Intelligence 2020
 
Automata and Compiler 2020
Automata and Compiler 2020Automata and Compiler 2020
Automata and Compiler 2020
 
Database 2020
Database 2020Database 2020
Database 2020
 
Software Engineering 2020
Software Engineering 2020Software Engineering 2020
Software Engineering 2020
 
Software Engineering 2018
Software Engineering 2018Software Engineering 2018
Software Engineering 2018
 
Database 2018
Database 2018Database 2018
Database 2018
 
Automate and Compiler 2018
Automate and Compiler 2018Automate and Compiler 2018
Automate and Compiler 2018
 
Artificial Intelligence 2018
Artificial Intelligence 2018Artificial Intelligence 2018
Artificial Intelligence 2018
 
Algorithms and Data Structure 2018
Algorithms and Data Structure 2018Algorithms and Data Structure 2018
Algorithms and Data Structure 2018
 
Data Storytelling
Data StorytellingData Storytelling
Data Storytelling
 
Geospatial Information Management
Geospatial Information ManagementGeospatial Information Management
Geospatial Information Management
 
Big Data for Development
Big Data for DevelopmentBig Data for Development
Big Data for Development
 
Personality Detection via MBTI Test
Personality Detection via MBTI TestPersonality Detection via MBTI Test
Personality Detection via MBTI Test
 
Fog Computing
Fog ComputingFog Computing
Fog Computing
 
Seasonal ARIMA
Seasonal ARIMASeasonal ARIMA
Seasonal ARIMA
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Network Address Translation (NAT)
Network Address Translation (NAT)Network Address Translation (NAT)
Network Address Translation (NAT)
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Spark SQL

  • 2. Content 1. Introduction. 2. Conceptual concepts: – RDD, Dataset and DataFrame, Hive Database. 3. Spark SQL The whole story. 4. How does it all work? 5. Spark in R: – Sparklyr Library. 6. Example. 7. References. 2By Joud Khattab
  • 3. Spark SQL ■ Spark SQL is a Spark module for structured data processing. ■ Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD. 3By Joud Khattab
  • 4. Spark SQL ■ Spark SQL was first released in Spark 1.0 (May, 2014). ■ Initial committed by Michael Armbrust & Reynold Xin from Databricks. ■ Spark introduces a programming module for structured data processing called Spark SQL. ■ It provides a programming abstraction called DataFrame and can act as distributed SQL query engine. 4By Joud Khattab
  • 5. Challenges and Solutions Challenges ■ Perform ETL to and from various (semi- or unstructured) data sources. ■ Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. Solutions ■ A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. ■ A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions. 5By Joud Khattab
  • 7. Spark SQL Architecture ■ Language API: – Spark is compatible with different languages and Spark SQL. – It is also, supported by these languages- API (python, scala, java, HiveQL). ■ Schema RDD: – Spark Core is designed with special data structure called RDD. – Generally, Spark SQL works on schemas, tables, and records. – Therefore, we can use the Schema RDD as temporary table. – We can call this Schema RDD as Data Frame. ■ Data Sources: – Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data Sources for Spark SQL is different. – Those are Parquet file, JSON document, HIVE tables, and Cassandra database. 7By Joud Khattab
  • 8. Features of Spark SQL 1. Integrated: – Seamlessly mix SQL queries with Spark programs. – Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. – This tight integration makes it easy to run SQL queries alongside complex analytic algorithms. 2. Unified Data Access: – Load and query data from a variety of sources. – Schema-RDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files. 8By Joud Khattab
  • 9. Features of Spark SQL 3. Hive Compatibility: – Run unmodified Hive queries on existing warehouses. – Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility with existing Hive data, queries, and UDFs. – Simply install it alongside Hive. 9 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) By Joud Khattab
  • 10. Features of Spark SQL 4. Standard Connectivity: – Connect through JDBC or ODBC. – Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity. 10By Joud Khattab
  • 11. Features of Spark SQL 5. Scalability: – Use the same engine for both interactive and long queries. – Spark SQL takes advantage of the RDD model to support mid-query fault tolerance, letting it scale to large jobs too. – Do not worry about using a different engine for historical data. 11By Joud Khattab
  • 12. SPARK RDD Resilient Distributed Datasets 12By Joud Khattab
  • 13. SPARK RDD (Resilient Distributed Datasets) ■ RDD is a fundamental data structure of Spark. ■ It is an immutable distributed collection of objects that can be stored in memory or disk across a cluster. ■ Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. ■ Parallel functional transformations (map, filter, …). ■ Automatically rebuilt on failure. ■ RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. 13By Joud Khattab
  • 14. SPARK RDD (Resilient Distributed Datasets) ■ Formally, an RDD is a read-only, partitioned collection of records. ■ RDDs can be created through deterministic operations on either data on stable storage or other RDDs. ■ RDD is a fault-tolerant collection of elements that can be operated on in parallel. ■ There are two ways to create RDDs: – parallelizing an existing collection in your driver program. – referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. ■ Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient. 14By Joud Khattab
  • 15. SPARK SQL DATASET AND DATAFRAME 15By Joud Khattab
  • 16. Dataset and DataFrame ■ A distributed collection of data, which is organized into named columns. ■ Conceptually, it is equivalent to relational tables with good optimization techniques. ■ A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs. ■ This API was designed for modern Big Data and data science applications taking inspiration from DataFrame in R Programming and Pandas in Python. 16By Joud Khattab
  • 17. Dataset and DataFrame ■ DataFrame – Data is organized into named columns, like a table in a relational database ■ Dataset: a distributed collection of data – A new interface added in Spark 1.6 – Static-typing and runtime type-safety 17By Joud Khattab
  • 18. Features of DataFrame ■ Ability to process the data in the size of Kilobytes to Petabytes on a single node cluster to large cluster. ■ Supports different data formats (Avro, csv, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, mysql, etc). ■ State of art optimization and code generation through the Spark SQL Catalyst optimizer (tree transformation framework). ■ Can be easily integrated with all Big Data tools and frameworks via Spark-Core. ■ Provides API for Python, Java, Scala, and R Programming. 18By Joud Khattab
  • 19. SPARK SQL & HIVE 19By Joud Khattab
  • 20. Hive Compatibility ■ Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. 20By Joud Khattab
  • 21. Hive ■ A database/data warehouse on top of Hadoop – Rich data types – Efficient implementations of SQL on top of map reduce ■ Support Analysis of large datasets stored in Hadoop's HDFS and compatible file systems – Such as Amazon S3 filesystem. ■ Provides an SQL-like language called HiveQL with schema. 21By Joud Khattab
  • 22. Hive Architecture ■ User issues SQL query ■ Hive parses and plans query ■ Query converted to Map-Reduce ■ Map-Reduce runs by Hadoop 22By Joud Khattab
  • 23. User-Defined Functions ■ UDF: Plug in your own processing code and invoke it from a Hive query – UDF (Plain UDF) ■ Input: single row, Output: single row – UDAF (User-Defined Aggregate Function) ■ Input: multiple rows, Output: single row ■ e.g. COUNT and MAX – UDTF (User-Defined Table-generating Function) ■ Input: single row, Output: multiple rows 23By Joud Khattab
  • 24. SPARK SQL THE WHOLE STORY 24By Joud Khattab
  • 25. The not-so-secret truth… 25 SQL is not about SQL. Decelerative Programing By Joud Khattab
  • 26. Spark SQL The whole story ■ Create and Run Spark Programs Faster: 1. Write less code. 2. Read less data. 3. Let the optimizer do the hard work. ■ RDD V.S. Dataframe. 26By Joud Khattab
  • 27. Write Less Code: Input & Output ■ Unified interface to reading/writing data in a variety of formats: 27 df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") read and write functions create new builders for doing I/O By Joud Khattab
  • 28. Write Less Code: Input & Output ■ Unified interface to reading/writing data in a variety of formats: 28 df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") Builder methods are used to specify: • Format • Partitioning • Handling of existing data • and more By Joud Khattab
  • 29. Write Less Code: Input & Output ■ Unified interface to reading/writing data in a variety of formats: 29 df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") load(…), save(…) or saveAsTable(…) functions create new builders for doing I/O By Joud Khattab
  • 30. Read Less Data: Efficient Formats ■ Parquet is an efficient columnar storage format: – Compact binary encoding with intelligent compression (delta, RLE, etc). – Each column stored separately with an index that allows skipping of unread columns. – Data skipping using statistics (column min/max, etc). 30By Joud Khattab
  • 31. Write Less Code: Powerful Operations ■ Common operations can be expressed concisely as calls to the DataFrame API: – Selecting required columns. – Joining different data sources. – Aggregation (count, sum, average, etc). – Filtering. – Plotting results. 31By Joud Khattab
  • 32. Write Less Code: Compute an Average 32 private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() By Joud Khattab
  • 33. Write Less Code: Compute an Average 33 Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name By Joud Khattab
  • 34. Not Just Less Code: Faster Implementations 34 0 2 4 6 8 10 DataFrameSQL DataFramePython DataFrameScala RDDPython RDDScala Time to Aggregate 10 million int pairs (secs) By Joud Khattab
  • 35. Plan Optimization & Execution 35 SQLAST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Catalog DataFrames and SQL share the same optimization/execution pipeline Code Generation By Joud Khattab
  • 36. 36 def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # udf adds city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "Amsterdam") .select(events.timestamp).collect() Logical Plan filter join events file users table expensive only join relevantusers Physical Plan join scan (events) filter scan (users) By Joud Khattab
  • 37. HOW DOES IT ALL WORK? 37By Joud Khattab
  • 38. An example query SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 38 Logical Plan Project name Filter id = 1 Project id,name People By Joud Khattab
  • 39. Naïve Query Planning SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 39 Logical Plan Project name Filter id = 1 Project id,name People Physical Plan Project name Filter id = 1 Project id,name TableScan People By Joud Khattab
  • 40. Optimized Execution ■ Writing imperative code to optimize all possible patterns is hard. ■ Instead write simple rules: – Each rule makes one change – Run many rules together to fixed point. 40 IndexLookup id = 1 return: name Physical Plan Logical Plan Project name Filter id = 1 Project id,name People By Joud Khattab
  • 41. Writing Rules as Tree Transformations 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. 41 Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down By Joud Khattab
  • 42. Optimizing with Rules 42 Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down Project name Filter id = 1 People Combine Projection IndexLookup id = 1 return: name Physical Plan By Joud Khattab
  • 43. SPARKLYR R interface for Apache Spark 43By Joud Khattab
  • 44. Sparklyr ■ Founded on September 2016. ■ Connect to Spark from R. ■ Provides a complete dplyr backend. ■ Filter and aggregate Spark datasets then bring them into R for analysis and visualization. ■ Use Sparks distributed machine learning library from R. 44By Joud Khattab
  • 45. Manipulating Data with dplyr ■ dplyr is an R package for working with structured data both in and outside of R. ■ dplyr makes data manipulation for R users easy, consistent, and performant. ■ With dplyr as an interface to manipulating Spark DataFrames, you can: – Select, filter, and aggregate data. – Use window functions (e.g. for sampling). – Perform joins on DataFrames. – Collect data from Spark into R. 45By Joud Khattab
  • 46. Reading Data ■ You can read data into Spark DataFrames using the following functions: – Spark_read_csv. – Spark_read_json. – Spark_read_parquet. ■ Regardless of the format of your data, Spark supports reading data from a variety of different data sources. These include data stored on HDFS, Amazon S3, or local files available to the spark worker nodes. ■ Each of these functions returns a reference to a Spark DataFrame which can be used as a dplyr table. 46By Joud Khattab
  • 47. Flights Data ■ This guide will demonstrate some of the basic data manipulation verbs of dplyr by using data from the nycflights13 R package. ■ This package contains data for all 336,776 flights departing New York City in 2013. It also includes useful metadata on airlines, airports, weather, and planes. ■ Connect to the cluster and copy the flights data using the copy_to function. ■ Note: – The flight data in nycflights13 is convenient for dplyr demonstrations because it is small, but in practice large data should rarely be copied directly from R objects. 47By Joud Khattab
  • 48. Flights Data library(sparklyr) library(dplyr) library(nycflights13) library(ggplot2) sc <- spark_connect(master="local") flights <- copy_to(sc, flights, "flights") airlines <- copy_to(sc, airlines, "airlines") src_tbls(sc) [1] "airlines" "flights" 48By Joud Khattab
  • 49. dplyr Verbs ■ Verbs are dplyr commands for manipulating data. ■ When connected to a Spark DataFrame, dplyr translates the commands into Spark SQL statements. ■ Remote data sources use exactly the same five verbs as local data sources. ■ Here are the five verbs with their corresponding SQL commands: – Select ~ select – Filter ~ where – Arrange ~ order – Summarise ~ aggregators: sum, min, sd, etc. – Mutate ~ operators: +. *, log, etc. 49By Joud Khattab
  • 50. dplyr Verbs: select select(flights, year:day, arr_delay, dep_delay) # Source: lazy query [?? x 5] # Database: spark_connection year month day arr_delay dep_delay <int> <int> <int> <dbl> <dbl> 1 2013 1 1 11 2 2 2013 1 1 20 4 3 2013 1 1 33 2 4 2013 1 1 -18 -1 5 2013 1 1 -25 -6 6 2013 1 1 12 -4 7 2013 1 1 19 -5 8 2013 1 1 -14 -3 9 2013 1 1 -8 -3 10 2013 1 1 8 -2 # ... with 3.368e+05 more rows 50By Joud Khattab
  • 51. dplyr Verbs: filter filter(flights, dep_delay > 1000) # Source: lazy query [?? x 19] # Database: spark_connection year month day dep_time sched_dep_time dep_delay <int> <int> <int> <int> <int> <dbl> 1 2013 1 9 641 900 1301 2 2013 1 10 1121 1635 1126 3 2013 6 15 1432 1935 1137 4 2013 7 22 845 1600 1005 5 2013 9 20 1139 1845 1014 # ... with 13 more variables: arr_time <int>, sched_arr_time # <int>, arr_delay <dbl>, carrier <chr>, flight <int>, talinum # <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance # <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl> 51By Joud Khattab
  • 52. dplyr Verbs: arrange arrange(flights, desc(dep_delay)) # Source: table<flights> [?? x 19] # Database: spark_connection # Ordered by: desc(dep_delay) year month day dep_time sched_dep_time dep_delay <int> <int> <int> <int> <int> <dbl> 1 2013 1 9 641 900 1301 2 2013 6 15 1432 1935 1137 3 2013 1 10 1121 1635 1126 4 2013 9 20 1139 1845 1014 5 2013 7 22 845 1600 1005 6 2013 4 10 1100 1900 960 7 2013 3 17 2321 810 911 8 2013 6 27 959 1900 899 9 2013 7 22 2257 759 898 10 2013 12 5 756 1700 896 # ... With 3.368e+05 more rows, and 13 more variables: # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest # <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute # <dbl>, time_hour <dbl> 52By Joud Khattab
  • 53. dplyr Verbs: summarise summarise(flights, mean_dep_delay = mean(dep_delay)) # Source: lazy query [?? x 1] # Database: spark_connection mean_dep_delay <dbl> 1 12.63907 53By Joud Khattab
  • 54. dplyr Verbs: mutate mutate(flights, speed = distance / air_time * 60) Source: query [3.368e+05 x 4] Database: spark connection master=local[4] app=sparklyr local=TRUE # A tibble: 3.368e+05 x 4 year month day speed <int> <int> <int> <dbl> 1 2013 1 1 370.0441 2 2013 1 1 374.2731 3 2013 1 1 408.3750 4 2013 1 1 516.7213 5 2013 1 1 394.1379 6 2013 1 1 287.6000 7 2013 1 1 404.4304 8 2013 1 1 259.2453 9 2013 1 1 404.5714 10 2013 1 1 318.6957 # ... with 3.368e+05 more rows 54By Joud Khattab
  • 55. Laziness ■ When working with databases, dplyr tries to be as lazy as possible: – It never pulls data into R unless you explicitly ask for it. – It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step. ■ For example, take the following code: – c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL’)) – c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance) – c3 <- arrange(c2, year, month, day, carrier) – c4 <- mutate(c3, air_time_hours = air_time / 60) 55By Joud Khattab
  • 56. Laziness ■ This sequence of operations never actually touches the database. ■ It’s not until you ask for the data (e.g. by printing c4) that dplyr requests the results from the database. # Source: lazy query [?? x 20] # Database: spark_connection year month day carrier dep_delay air_time distance air_time_hours <int> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> 1 2013 5 17 AA -2 294 2248 4.900000 2 2013 5 17 AA -1 146 1096 2.433333 3 2013 5 17 AA -2 185 1372 3.083333 4 2013 5 17 AA -9 186 1389 3.100000 5 2013 5 17 AA 2 147 1096 2.450000 6 2013 5 17 AA -4 114 733 1.900000 7 2013 5 17 AA -7 117 733 1.950000 8 2013 5 17 AA -7 142 1089 2.366667 9 2013 5 17 AA -6 148 1089 2.466667 10 2013 5 17 AA -7 137 944 2.283333 # ... with more rows 56By Joud Khattab
  • 57. Piping ■ You can use magrittr pipes to write cleaner syntax. Using the same example from above, you can write a much cleaner version like this: – c4 <- flights %>% filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>% select(carrier, dep_delay, air_time, distance) %>% arrange(carrier) %>% mutate(air_time_hours = air_time / 60) 57By Joud Khattab
  • 58. Grouping c4 %>% group_by(carrier) %>% summarize(count = n(), mean_dep_delay = mean(dep_delay)) Source: query [?? x 3] Database: spark connection master=local app=sparklyr local=TRUE # S3: tbl_spark carrier count mean_dep_delay <chr> <dbl> <dbl> 1 AA 94 1.468085 2 UA 172 9.633721 3 WN 34 7.970588 4 DL 136 6.235294 58By Joud Khattab
  • 59. Collecting to R ■ You can copy data from Spark into R’s memory by using collect(). ■ collect() executes the Spark query and returns the results to R for further analysis and visualization. 59By Joud Khattab
  • 60. Collecting to R carrierhours <- collect(c4) # Test the significance of pairwise differences and plot the results with(carrierhours, pairwise.t.test(air_time, carrier)) Pairwise comparisons using t tests with pooled SD data: air_time and carrier AA DL UA DL 0.25057 - - UA 0.07957 0.00044 - WN 0.07957 0.23488 0.00041 P value adjustment method: holm 60By Joud Khattab
  • 61. Collecting to R ggplot(carrierhours, aes(carrier, air_time_hours)) + geom_boxplot() 61By Joud Khattab
  • 62. Window Functions ■ dplyr supports Spark SQL window functions. ■ Window functions are used in conjunction with mutate and filter to solve a wide range of problems. ■ You can compare the dplyr syntax to the query it has generated by using sql_render(). 62By Joud Khattab
  • 63. Window Functions # Rank each flight within a daily ranked <- flights %>% group_by(year, month, day) %>% select(dep_delay) %>% mutate(rank = rank(desc(dep_delay))) sql_render(ranked) <SQL> SELECT `year`, `month`, `day`, `dep_delay`, rank() OVER (PARTITION BY `year`, `month`, `day` ORDER BY `dep_delay` DESC) AS `rank` FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `dep_delay` AS `dep_delay` FROM `flights`) `uflidyrkpj` 65By Joud Khattab
  • 64. Window Functions ranked # Source: lazy query [?? x 20] # Database: spark_connection year month day dep_delay rank <int> <int> <int> <dbl> <int> 1 2013 1 5 327 1 2 2013 1 5 257 2 3 2013 1 5 225 3 4 2013 1 5 128 4 5 2013 1 5 127 5 6 2013 1 5 117 6 7 2013 1 5 111 7 8 2013 1 5 108 8 9 2013 1 5 105 9 10 2013 1 5 101 10 # ... with more rows 66By Joud Khattab
  • 65. Performing Joins ■ It’s rare that a data analysis involves only a single table of data. ■ In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. ■ In dplyr, there are three families of verbs that work with two tables at a time: – Mutating joins, which add new variables to one table from matching rows in another. – Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table. – Set operations, which combine the observations in the data sets as if they were set elements. ■ All two-table verbs work similarly. The first two arguments are x and y, and provide the tables to combine. The output is always a new table with the same type as x. 67By Joud Khattab
  • 66. Performing Joins The following statements are equivalent: flights %>% left_join(airlines) flights %>% left_join(airlines, by = "carrier") # Source: lazy query [?? x 20] # Database: spark_connection year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # ... with more rows 68By Joud Khattab
  • 67. Sampling ■ You can use sample_n() and sample_frac() to take a random sample of rows: – use sample_n() for a fixed number and sample_frac() for a fixed fraction. ■ Ex: – sample_n(flights, 10) – sample_frac(flights, 0.01) 69By Joud Khattab
  • 68. SPARK SQL EXAMPLE Analysis of babynames with dplyr 70By Joud Khattab
  • 69. Analysis of babynames with dplyr 1. Setup. 2. Connect to Spark. 3. Total US births. 4. Aggregate data by name. 5. Most popular names (1986). 6. Most popular names (2014). 7. Shared names. 71By Joud Khattab
  • 70. References 1. https://spark.apache.org/docs/latest/sql-programming-guide.html 2. http://spark.rstudio.com/ 3. https://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune 4. http://www.tutorialspoint.com/spark_sql/ 5. https://www.youtube.com/watch?v=A7Ef_ZB884g 72By Joud Khattab