SlideShare ist ein Scribd-Unternehmen logo
1 von 87
Building a modern Application
w/ DataFrames
Meetup @ [24]7 in Campbell, CA
Sept 8, 2015
Who am I?
Sameer
Farooqui
• Trainer @ Databricks
• 150+ trainings on Hadoop, C*,
HBase, Couchbase, NoSQL, etc
Google: “spark newcircle foundations” / code: SPARK-
MEETUPS-15
Who are you?
1) I have used Spark hands on before…
2) I have used DataFrames before (in any language)…
Agenda
• Be able to smartly use DataFrames tomorrow!
+ Intro + Advanced
Demo!• Spark
Overview
• Catalyst
Internals
• DataFrames (10 mins)
The Databricks team contributed more than 75% of the code added to Spark in the
past year
6
{JSON}
Data Sources
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
RDD API
DataFrames API
7
Goal: unified engine across data sources,
workloads and environments
Spark – 100% open source and mature
Used in production by over 500 organizations. From fortune 100 to small innovators
0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
9
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
10
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk 100 TB sort record
12
On-Disk Sort Record:
Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014
Record:
Spark
207
machines
23 minutes
Spark Driver
Executor
Task Task
Executor
Task Task
Executor
Task Task
Executor
Task Task
Spark Physical Cluster
JVM
JVM JVM JVM JVM
Spark Data Model
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
RDD with 4 partitions
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Spark Data Model
item-1
item-2
item-3
item-4
item-5
item-6
item-6
item-8
item-9
item-10
Ex
RD
DRD
D
Ex
RD
DRD
D
Ex
RD
D
more partitions = more parallelism
RDD
16
DataFrame APIs
Spark Data Model
DataFrame with 4 partitions
logLinesDF
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Warn ts msg2
Error ts msg1
Type Time Msg
(Str
)
(Int
)
(Str
)
Info ts msg7
Warn ts msg2
Error ts msg9
Type Time Msg
(Str
)
(Int
)
(Str
)
Warn ts msg0
Warn ts msg2
Info ts msg11
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Error ts msg3
Error ts msg1
df.rdd.partitions.size = 4
Spark Data Model
- -
-
Ex
DF
DF
Ex
DF
DF
Ex
DF
more partitions = more parallelism
E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DataFrame
19
DataFrame Benefits
• Easier to program
• Significantly fewer Lines of Code
• Improved performance
• via intelligent optimizations and code-generation
Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
20
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Full API Docs
• Python
• Scala
• Java
• R
21
22
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
23
DataFrames are evaluated lazily
Distributed Storage
or
Catalyst +
Execute DAG!
24
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
Transformation examples Action examples
Transformations, Actions, Laziness
count
collect
show
head
take
filter
select
drop
intersect
join
25
DataFrames are lazy. Transformations contribute
to the query plan, but they don't execute
anything.
Actions cause the execution of the query.
3 Fundamental transformations on DataFrames
- mapPartitions()
- New ShuffledRDD
- ZipPartitions()
Graduate
d from
Alpha in
1.3
Spark SQL
– Part of the core distribution since Spark 1.0 (April
2014)
SQL
27
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
27
28
Which context?
SQLContext
• Basic functionality
HiveContext
• More advanced
• Superset of SQLContext
• More complete HiveQL parser
• Can read from Hive metastore
+ tables
• Access to Hive UDFs
Improved
multi-version
support in
1.4
Construct a DataFrame
29
# Construct a DataFrame from a "users" table in Hive.
df = sqlContext.read.table("users")
# Construct a DataFrame from a log file in S3.
df = sqlContext.read.json("s3n://someBucket/path/to/data.json",
"json")
val people = sqlContext.read.parquet("...")
DataFrame people = sqlContext.read().parquet("...")
Use DataFrames
30
# Create a new DataFrame that contains only "young" users
young = users.filter(users["age"] < 21)
# Alternatively, using a Pandas-like syntax
young = users[users.age < 21]
# Increment everybody's age by 1
young.select(young["name"], young["age"] + 1)
# Count the number of young users by gender
young.groupBy("gender").count()
# Join young users with another DataFrame, logs
young.join(log, logs["userId"] == users["userId"], "left_outer")
DataFrames and Spark SQL
31
young.registerTempTable("young")
sqlContext.sql("SELECT count(*) FROM young")
Actions on a DataFrame
Functions on a DataFrame
Functions on a DataFrame
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Queries on a DataFrame
Operations on a DataFrame
Creating DataFrames
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
E, T, M
E, T, M
RD
D
E, T, M
E, T, M
E, T, M
E, T, M
DF
Data Sources
39
Data Sources API
• Provides a pluggable mechanism for accessing structured data
through Spark SQL
• Tight optimizer integration means filtering and column pruning
can often be pushed all the way down to data sources
• Supports mounting external sources as temp tables
• Introduced in Spark 1.2 via SPARK-3247
40
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames
using a variety of formats.
40
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org
41
Spark Packages
Supported Data Sources:
• Avro
• Redshift
• CSV
• MongoDB
• Cassandra
• Cloudant
• Couchbase
• ElasticSearch
• Mainframes (IBM z/OS)
• Many More!
42
DataFrames: Reading from JDBC
1.3
• Supports any JDBC compatible RDBMS: MySQL, PostGres, H2, etc
• Unlike the pure RDD implementation (JdbcRDD), this supports
predicate pushdown and auto-converts the data into a DataFrame
• Since you get a DataFrame back, it’s usable in Java/Python/R/Scala.
• JDBC server allows multiple users to share one Spark cluster
Read Less Data
The fastest way to process big data is to never read
it.
Spark SQL can help you read less data
automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 43
• Converting to more efficient formats
• Using columnar formats (i.e. parquet)
• Using partitioning (i.e., /year=2014/month=02/…)1
• Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
Fall 2012: &
July 2013: 1.0 release
May 2014: Apache Incubator, 40+
contributors
• Limits I/O: Scans/Reads only the columns that are needed
• Saves Space: Columnar layout compresses better
Logical table
representation
Row Layout
Column Layout
Source: parquet.apache.org
Reading:
• Readers are
first read the file
metadata to find all
column chunks they
interested in.
• The columns chunks
should then be read
sequentially.
Writing:
• Metadata is written
after the data to
allow for single pass
writing.
Parquet Features
1. Metadata merging
• Allows developers to easily add/remove columns in data files
• Spark will scan all metadata for files and merge the schemas
2. Auto-discover data that has been partitioned into
folders
• And then prune which folders are scanned based on
predicates
So, you can greatly speed up
queries simply by breaking up data
into folders:
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
47
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
read and write
functions create new
builders for doing I/O
48
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
49
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
load(…), save(…)
or saveAsTable(…)
finish the I/O
specification
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
50
51
How are statistics used to improve DataFrames performance?
• Statistics are logged when caching
• During reads, these statistics can be used to skip some
cached partitions
• InMemoryColumnarTableScan can now skip partitions that cannot
possibly contain any matching rows
- - -
9 x x
8 x x
- - -
4 x x
7 x x
- - -
8 x x
2 x x
DF
max(a)=
9
max(a)=
7
max(a)=
8
Predicate: a = 8
Reference:
• https://github.com/apache/spark/pull/1883
• https://github.com/apache/spark/pull/2188
Filters Supported:
• =, <, <=, >, >=
DataFrame # of Partitions after Shuffle
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
2
sqlContex.setConf(key, value)
spark.sql.shuffle.partititions
defaults to 200
Spark 1.6: Adaptive
Shuffle
Shuffle
Caching a DataFrame
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
Spark SQL will re-encode the data into byte buffers before
calling caching so that there is less pressure on the GC.
.cache()
Demo!
Schema Inference
What if your data file doesn’t have a schema? (e.g., You’re reading a
CSV file or a plain text file.)
You can create an RDD of a particular type and let Spark infer the
schema from that type. We’ll see how to do that in a moment.
You can use the API to specify the schema programmatically.
(It’s better to use a schema-oriented input source if you can, though.)
Schema Inference Example
Suppose you have a (text) file that looks like
this:
56
The file has no schema,
but it’s obvious there is
one:
First name:string
Last name: string
Gender: string
Age: integer
Let’s see how to get Spark to infer the schema.
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
Schema Inference :: Scala
57
import sqlContext.implicits._
case class Person(firstName: String,
lastName: String,
gender: String,
age: Int)
val rdd = sc.textFile("people.csv")
val peopleRDD = rdd.map { line =>
val cols = line.split(",")
Person(cols(0), cols(1), cols(2), cols(3).toInt)
}
val df = peopleRDD.toDF
// df: DataFrame = [firstName: string, lastName: string,
gender: string, age: int]
A brief look at spark-csv
Let’s assume our data file has a header:
58
first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
A brief look at spark-csv
With spark-csv, we can simply create a DataFrame
directly from our CSV file.
59
// Scala
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
load("people.csv")
# Python
df = sqlContext.read.format("com.databricks.spark.csv").
load("people.csv", header="true")
60
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
61
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
DataFrame Operations
Selected
Physical
Plan
Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source,
so irrelevant data can be skipped
• Parquet: skip entire blocks, turn
comparisons on strings into cheaper
integer comparisons via dictionary
encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose between broadcast
joins and shuffle joins to reduce network
traffic
• Lower level optimizations: eliminate
expensive object allocations and reduce
virtual function calls
Not Just Less Code: Faster Implementations
63
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
Catalyst Goals
64
1) Make it easy to add new optimization techniques and features
to Spark SQL
2) Enable developers to extend the optimizer
• For example, to add data source specific rules that can push filtering
or aggregation into external storage systems
• Or to support new data types
Catalyst: Trees
65
• Tree: Main data type in Catalyst
• Tree is made of node objects
• Each node has type and 0 or
more children
• New node types are defined as
subclasses of TreeNode class
• Nodes are immutable and are
manipulated via functional
transformations
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two
expressions.
Imagine we have the following 3 node classes for a very simple
expression language:
Build a tree for the expression: x + (1+2)
In Scala code: Add(Attribute(x), Add(Literal(1),
Literal(2)))
Catalyst: Rules
66
• Rules: Trees are manipulated
using rules
• A rule is a function from a tree to
another tree
• Commonly, Catalyst will use a set
of pattern matching functions to
find and replace subtrees
• Trees offer a transform method
that applies a pattern matching
function recursively on all nodes
of the tree, transforming the ones
that match each pattern to a
result
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
}
Let’s implement a rule that folds Add operations between
constants:
Apply this to the tree: x + (1+2)
Yields: x + 3
• The rule may only match a subset of all possible input trees
• Catalyst tests which parts of a tree a given rule may apply to,
and skips over or descends into subtrees that do not match
• Rules don’t need to be modified as new types of operators are
added
Catalyst: Rules
67
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
Rules can match multiple patterns in the same transform call:
Apply this to the tree: x + (1+2)
Still yields: x + 3
Apply this to the tree: (x+0) + (3+3)
Now yields: x + 6
Catalyst: Rules
68
• Rules may need to execute multiple times to fully transform a
tree
• Rules are grouped into batches
• Each batch is executed to a fixed point (until tree stops
changing)
Example:
• Constant fold larger trees
Example:
• First batch analyzes an expression to assign
types to all attributes
• Second batch uses the new types to do
constant folding
• Rule conditions and their bodies contain arbitrary Scala code
• Takeaway: Functional transformations on immutable trees (easy to reason &
debug)
• Coming soon: Enable parallelization in the optimizer
69
Using Catalyst in Spark SQL
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzing a logical plan to resolve references
Logical Optimization: logical plan optimization
Physical Planning: Physical planning
Code Generation: Compile parts of the query to Java
bytecode
Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog- - - - - -
DF • Relation may contain unresolved attribute
references or relations
• Example: “SELECT col FROM sales”
• Type of col is unknown
• Even if it’s a valid col name is unknown (till we look up the
table)
Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
• Attribute is unresolved if:
• Catalyst doesn’t know its type
• Catalyst has not matched it to an input table
• Catalyst will use rules and a Catalog object (which tracks all the
tables in all data sources) to resolve these attributes
Step 1: Build “unresolved logical plan”
Step 2: Apply rules
Analysis Rules
• Look up relations by name in Catalog
• Map named attributes (like col) to the
input
• Determine which attributes refer to the
same value to give them a unique ID (for
later optimizations)
• Propagate and coerce types through
expressions
• We can’t know return type of 1 + col until we
have resolved col
Catalyst: Analyer.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
< 500 lines of code
Catalyst: Logical Optimizations
73
Logical Plan
Optimized
Logical Plan
Logical
Optimization • Applies rule-based optimizations to the logical
plan:
• Constant folding
• Predicate pushdown
• Projection pruning
• Null propagation
• Boolean expression simplification
• [Others]
• Example: a 12-line rule optimizes LIKE
expressions with simple regular expressions
into String.startsWith or String.contains calls.
Catalyst: Optimizer.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
< 700 lines of code
Catalyst: Physical Planning
75
• Spark SQL takes a logical plan and generations one or more
physical plans using physical operators that match the Spark
Execution engine:
1. mapPartitions()
2. new ShuffledRDD
3. zipPartitions()
• Currently cost-based optimization is only used to select a join
algorithm
• Broadcast join
• Traditional join
• Physical planner also performs rule-based physical
optimizations like pipelining projections or filters into one Spark
map operation
• It can also push operations from logical plan into data sources
(predicate pushdown)
Optimized
Logical Plan
Physical
Planning
Physical
Plans
Catalyst: SparkStrategies.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/core/src
/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
< 400 lines of code
Catalyst: Code Generation
77
• Generates Java bytecode to run on each machine
• Catalyst relies on janino to make code generation simple
• (FYI - It used to be quasiquotes, but now is janino)RDDs
Selected
Physical
Plan
Code
Generation
This code gen function converts an expression
like (x+y) + 1 to a
Scala AST:
Catalyst: CodeGenerator.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
< 700 lines of code
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity = udf(lambda zipCode: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events 
.join(u, events.user_id == u.user_id) 
.withColumn("city", zipToCity(df.zip))
Augments
any
DataFrame
that contains
user_id 79
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
80
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events 
.where(events.city == "San Francisco") 
.select(events.timestamp) 
.collect()
81
def add_demographics(events):
u = sqlCtx.table("users") # Load Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
81
82
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
82
Spark 1.5 –Speed / Robustness
Project Tungsten
– Tightly packed binary
structures
– Fully-accounted
memory with automatic
spilling
– Reduced serialization
costs
83
0
20
40
60
80
100
120
140
160
180
200
1x 2x 4x 8x 16x
Average
GC
time per
node
(seconds)
Data set size (relative)
Default Code Gen
Tungsten onheap Tungsten offheap
100+ native functions with
optimized codegen
implementations
– String manipulation –
concat, format_string,
lower, lpad
– Data/Time –
current_timestamp,
date_format, date_add
– Math – sqrt, randn
– Other –
monotonicallyIncreasingId,
sparkPartitionId
84
Spark 1.5 – Improved Function Library
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Window Functions
Before Spark 1.4:
- 2 kinds of functions in Spark that could return a single
value:
• Built-in functions or UDFs (round)
• take values from a single row as input, and they
generate a single return value for every input row
• Aggregate functions (sum or max)
• operate on a group of rows and calculate a single
return value for every group
New with Spark 1.4:
• Window Functions (moving avg, cumulative sum)
• operate on a group of rows while still returning a single
value for every input row.
Streaming DataFrames
Umbrella ticket to track what's needed to
make streaming DataFrame a reality:
https://issues.apache.org/jira/browse/SPARK-8360

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitorInfluxData
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Instana - ClickHouse presentation
Instana - ClickHouse presentationInstana - ClickHouse presentation
Instana - ClickHouse presentationMiel Donkers
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
Scalabilité de MongoDB
Scalabilité de MongoDBScalabilité de MongoDB
Scalabilité de MongoDBMongoDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisKnoldus Inc.
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedis Labs
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing EcosystemDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)confluent
 

Was ist angesagt? (20)

Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Mongo indexes
Mongo indexesMongo indexes
Mongo indexes
 
Instana - ClickHouse presentation
Instana - ClickHouse presentationInstana - ClickHouse presentation
Instana - ClickHouse presentation
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Scalabilité de MongoDB
Scalabilité de MongoDBScalabilité de MongoDB
Scalabilité de MongoDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)
 

Andere mochten auch

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
vJUG Getting C C++ performance out of java
vJUG Getting C C++ performance out of javavJUG Getting C C++ performance out of java
vJUG Getting C C++ performance out of javaC24 Technologies
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning晨揚 施
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
John Davies: "High Performance Java Binary" from JavaZone 2015
John Davies: "High Performance Java Binary" from JavaZone 2015John Davies: "High Performance Java Binary" from JavaZone 2015
John Davies: "High Performance Java Binary" from JavaZone 2015C24 Technologies
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Finding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFramesFinding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFramesSpark Summit
 

Andere mochten auch (20)

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
vJUG Getting C C++ performance out of java
vJUG Getting C C++ performance out of javavJUG Getting C C++ performance out of java
vJUG Getting C C++ performance out of java
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
John Davies: "High Performance Java Binary" from JavaZone 2015
John Davies: "High Performance Java Binary" from JavaZone 2015John Davies: "High Performance Java Binary" from JavaZone 2015
John Davies: "High Performance Java Binary" from JavaZone 2015
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Finding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFramesFinding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFrames
 

Ähnlich wie Building a modern Application with DataFrames

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 

Ähnlich wie Building a modern Application with DataFrames (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Kürzlich hochgeladen (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Building a modern Application with DataFrames

  • 1. Building a modern Application w/ DataFrames Meetup @ [24]7 in Campbell, CA Sept 8, 2015
  • 2. Who am I? Sameer Farooqui • Trainer @ Databricks • 150+ trainings on Hadoop, C*, HBase, Couchbase, NoSQL, etc Google: “spark newcircle foundations” / code: SPARK- MEETUPS-15
  • 3. Who are you? 1) I have used Spark hands on before… 2) I have used DataFrames before (in any language)…
  • 4. Agenda • Be able to smartly use DataFrames tomorrow! + Intro + Advanced Demo!• Spark Overview • Catalyst Internals • DataFrames (10 mins)
  • 5. The Databricks team contributed more than 75% of the code added to Spark in the past year
  • 7. 7 Goal: unified engine across data sources, workloads and environments
  • 8. Spark – 100% open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators
  • 9. 0 20 40 60 80 100 120 140 2011 2012 2013 2014 2015 Contributors per Month to Spark Most active project in big data 9
  • 10. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 10
  • 11. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record
  • 12. 12 On-Disk Sort Record: Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  • 13. Spark Driver Executor Task Task Executor Task Task Executor Task Task Executor Task Task Spark Physical Cluster JVM JVM JVM JVM JVM
  • 14. Spark Data Model Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 RDD with 4 partitions Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 logLinesRDD
  • 17. Spark Data Model DataFrame with 4 partitions logLinesDF Type Time Msg (Str ) (Int ) (Str ) Error ts msg1 Warn ts msg2 Error ts msg1 Type Time Msg (Str ) (Int ) (Str ) Info ts msg7 Warn ts msg2 Error ts msg9 Type Time Msg (Str ) (Int ) (Str ) Warn ts msg0 Warn ts msg2 Info ts msg11 Type Time Msg (Str ) (Int ) (Str ) Error ts msg1 Error ts msg3 Error ts msg1 df.rdd.partitions.size = 4
  • 18. Spark Data Model - - - Ex DF DF Ex DF DF Ex DF more partitions = more parallelism E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DataFrame
  • 19. 19 DataFrame Benefits • Easier to program • Significantly fewer Lines of Code • Improved performance • via intelligent optimizations and code-generation
  • 20. Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() 20
  • 21. Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Full API Docs • Python • Scala • Java • R 21
  • 22. 22 DataFrames are evaluated lazily - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - E T E T - - E T E T - - E T E T DF- 2 - - E T E T - - E T E T - - E T E T DF- 3 Distributed Storage or
  • 23. 23 DataFrames are evaluated lazily Distributed Storage or Catalyst + Execute DAG!
  • 24. 24 DataFrames are evaluated lazily - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - E T E T - - E T E T - - E T E T DF- 2 - - E T E T - - E T E T - - E T E T DF- 3 Distributed Storage or
  • 25. Transformation examples Action examples Transformations, Actions, Laziness count collect show head take filter select drop intersect join 25 DataFrames are lazy. Transformations contribute to the query plan, but they don't execute anything. Actions cause the execution of the query.
  • 26. 3 Fundamental transformations on DataFrames - mapPartitions() - New ShuffledRDD - ZipPartitions()
  • 27. Graduate d from Alpha in 1.3 Spark SQL – Part of the core distribution since Spark 1.0 (April 2014) SQL 27 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 27
  • 28. 28 Which context? SQLContext • Basic functionality HiveContext • More advanced • Superset of SQLContext • More complete HiveQL parser • Can read from Hive metastore + tables • Access to Hive UDFs Improved multi-version support in 1.4
  • 29. Construct a DataFrame 29 # Construct a DataFrame from a "users" table in Hive. df = sqlContext.read.table("users") # Construct a DataFrame from a log file in S3. df = sqlContext.read.json("s3n://someBucket/path/to/data.json", "json") val people = sqlContext.read.parquet("...") DataFrame people = sqlContext.read().parquet("...")
  • 30. Use DataFrames 30 # Create a new DataFrame that contains only "young" users young = users.filter(users["age"] < 21) # Alternatively, using a Pandas-like syntax young = users[users.age < 21] # Increment everybody's age by 1 young.select(young["name"], young["age"] + 1) # Count the number of young users by gender young.groupBy("gender").count() # Join young users with another DataFrame, logs young.join(log, logs["userId"] == users["userId"], "left_outer")
  • 31. DataFrames and Spark SQL 31 young.registerTempTable("young") sqlContext.sql("SELECT count(*) FROM young")
  • 32. Actions on a DataFrame
  • 33. Functions on a DataFrame
  • 34. Functions on a DataFrame
  • 36.
  • 37. Operations on a DataFrame
  • 38. Creating DataFrames - - -E T ME T M - - -E T ME T M - - -E T ME T M E, T, M E, T, M RD D E, T, M E, T, M E, T, M E, T, M DF Data Sources
  • 39. 39 Data Sources API • Provides a pluggable mechanism for accessing structured data through Spark SQL • Tight optimizer integration means filtering and column pruning can often be pushed all the way down to data sources • Supports mounting external sources as temp tables • Introduced in Spark 1.2 via SPARK-3247
  • 40. 40 Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 40 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org
  • 41. 41 Spark Packages Supported Data Sources: • Avro • Redshift • CSV • MongoDB • Cassandra • Cloudant • Couchbase • ElasticSearch • Mainframes (IBM z/OS) • Many More!
  • 42. 42 DataFrames: Reading from JDBC 1.3 • Supports any JDBC compatible RDBMS: MySQL, PostGres, H2, etc • Unlike the pure RDD implementation (JdbcRDD), this supports predicate pushdown and auto-converts the data into a DataFrame • Since you get a DataFrame back, it’s usable in Java/Python/R/Scala. • JDBC server allows multiple users to share one Spark cluster
  • 43. Read Less Data The fastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 43 • Converting to more efficient formats • Using columnar formats (i.e. parquet) • Using partitioning (i.e., /year=2014/month=02/…)1 • Skipping data using statistics (i.e., min, max)2 • Pushing predicates into storage systems (i.e., JDBC)
  • 44. Fall 2012: & July 2013: 1.0 release May 2014: Apache Incubator, 40+ contributors • Limits I/O: Scans/Reads only the columns that are needed • Saves Space: Columnar layout compresses better Logical table representation Row Layout Column Layout
  • 45. Source: parquet.apache.org Reading: • Readers are first read the file metadata to find all column chunks they interested in. • The columns chunks should then be read sequentially. Writing: • Metadata is written after the data to allow for single pass writing.
  • 46. Parquet Features 1. Metadata merging • Allows developers to easily add/remove columns in data files • Spark will scan all metadata for files and merge the schemas 2. Auto-discover data that has been partitioned into folders • And then prune which folders are scanned based on predicates So, you can greatly speed up queries simply by breaking up data into folders:
  • 47. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 47
  • 48. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") read and write functions create new builders for doing I/O 48
  • 49. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 49
  • 50. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…) finish the I/O specification df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 50
  • 51. 51 How are statistics used to improve DataFrames performance? • Statistics are logged when caching • During reads, these statistics can be used to skip some cached partitions • InMemoryColumnarTableScan can now skip partitions that cannot possibly contain any matching rows - - - 9 x x 8 x x - - - 4 x x 7 x x - - - 8 x x 2 x x DF max(a)= 9 max(a)= 7 max(a)= 8 Predicate: a = 8 Reference: • https://github.com/apache/spark/pull/1883 • https://github.com/apache/spark/pull/2188 Filters Supported: • =, <, <=, >, >=
  • 52. DataFrame # of Partitions after Shuffle - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 2 sqlContex.setConf(key, value) spark.sql.shuffle.partititions defaults to 200 Spark 1.6: Adaptive Shuffle Shuffle
  • 53. Caching a DataFrame - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 Spark SQL will re-encode the data into byte buffers before calling caching so that there is less pressure on the GC. .cache()
  • 54. Demo!
  • 55. Schema Inference What if your data file doesn’t have a schema? (e.g., You’re reading a CSV file or a plain text file.) You can create an RDD of a particular type and let Spark infer the schema from that type. We’ll see how to do that in a moment. You can use the API to specify the schema programmatically. (It’s better to use a schema-oriented input source if you can, though.)
  • 56. Schema Inference Example Suppose you have a (text) file that looks like this: 56 The file has no schema, but it’s obvious there is one: First name:string Last name: string Gender: string Age: integer Let’s see how to get Spark to infer the schema. Erin,Shannon,F,42 Norman,Lockwood,M,81 Miguel,Ruiz,M,64 Rosalita,Ramirez,F,14 Ally,Garcia,F,39 Claire,McBride,F,23 Abigail,Cottrell,F,75 José,Rivera,M,59 Ravi,Dasgupta,M,25 …
  • 57. Schema Inference :: Scala 57 import sqlContext.implicits._ case class Person(firstName: String, lastName: String, gender: String, age: Int) val rdd = sc.textFile("people.csv") val peopleRDD = rdd.map { line => val cols = line.split(",") Person(cols(0), cols(1), cols(2), cols(3).toInt) } val df = peopleRDD.toDF // df: DataFrame = [firstName: string, lastName: string, gender: string, age: int]
  • 58. A brief look at spark-csv Let’s assume our data file has a header: 58 first_name,last_name,gender,age Erin,Shannon,F,42 Norman,Lockwood,M,81 Miguel,Ruiz,M,64 Rosalita,Ramirez,F,14 Ally,Garcia,F,39 Claire,McBride,F,23 Abigail,Cottrell,F,75 José,Rivera,M,59 Ravi,Dasgupta,M,25 …
  • 59. A brief look at spark-csv With spark-csv, we can simply create a DataFrame directly from our CSV file. 59 // Scala val df = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). load("people.csv") # Python df = sqlContext.read.format("com.databricks.spark.csv"). load("people.csv", header="true")
  • 60. 60 DataFrames: Under the hood SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 61. 61 DataFrames: Under the hood SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan CostModel Physical Plans Catalog DataFrame Operations Selected Physical Plan
  • 62. Catalyst Optimizations Logical Optimizations Create Physical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  • 63. Not Just Less Code: Faster Implementations 63 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs) https://gist.github.com/rxin/c1592c133e4bccf515dd
  • 64. Catalyst Goals 64 1) Make it easy to add new optimization techniques and features to Spark SQL 2) Enable developers to extend the optimizer • For example, to add data source specific rules that can push filtering or aggregation into external storage systems • Or to support new data types
  • 65. Catalyst: Trees 65 • Tree: Main data type in Catalyst • Tree is made of node objects • Each node has type and 0 or more children • New node types are defined as subclasses of TreeNode class • Nodes are immutable and are manipulated via functional transformations • Literal(value: Int): a constant value • Attribute(name: String): an attribute from an input row, e.g.,“x” • Add(left: TreeNode, right: TreeNode): sum of two expressions. Imagine we have the following 3 node classes for a very simple expression language: Build a tree for the expression: x + (1+2) In Scala code: Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 66. Catalyst: Rules 66 • Rules: Trees are manipulated using rules • A rule is a function from a tree to another tree • Commonly, Catalyst will use a set of pattern matching functions to find and replace subtrees • Trees offer a transform method that applies a pattern matching function recursively on all nodes of the tree, transforming the ones that match each pattern to a result tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } Let’s implement a rule that folds Add operations between constants: Apply this to the tree: x + (1+2) Yields: x + 3 • The rule may only match a subset of all possible input trees • Catalyst tests which parts of a tree a given rule may apply to, and skips over or descends into subtrees that do not match • Rules don’t need to be modified as new types of operators are added
  • 67. Catalyst: Rules 67 tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right } Rules can match multiple patterns in the same transform call: Apply this to the tree: x + (1+2) Still yields: x + 3 Apply this to the tree: (x+0) + (3+3) Now yields: x + 6
  • 68. Catalyst: Rules 68 • Rules may need to execute multiple times to fully transform a tree • Rules are grouped into batches • Each batch is executed to a fixed point (until tree stops changing) Example: • Constant fold larger trees Example: • First batch analyzes an expression to assign types to all attributes • Second batch uses the new types to do constant folding • Rule conditions and their bodies contain arbitrary Scala code • Takeaway: Functional transformations on immutable trees (easy to reason & debug) • Coming soon: Enable parallelization in the optimizer
  • 69. 69 Using Catalyst in Spark SQL SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzing a logical plan to resolve references Logical Optimization: logical plan optimization Physical Planning: Physical planning Code Generation: Compile parts of the query to Java bytecode
  • 70. Catalyst: Analysis SQL AST DataFrame Unresolved Logical Plan Logical Plan Analysis Catalog- - - - - - DF • Relation may contain unresolved attribute references or relations • Example: “SELECT col FROM sales” • Type of col is unknown • Even if it’s a valid col name is unknown (till we look up the table)
  • 71. Catalyst: Analysis SQL AST DataFrame Unresolved Logical Plan Logical Plan Analysis Catalog • Attribute is unresolved if: • Catalyst doesn’t know its type • Catalyst has not matched it to an input table • Catalyst will use rules and a Catalog object (which tracks all the tables in all data sources) to resolve these attributes Step 1: Build “unresolved logical plan” Step 2: Apply rules Analysis Rules • Look up relations by name in Catalog • Map named attributes (like col) to the input • Determine which attributes refer to the same value to give them a unique ID (for later optimizations) • Propagate and coerce types through expressions • We can’t know return type of 1 + col until we have resolved col
  • 73. Catalyst: Logical Optimizations 73 Logical Plan Optimized Logical Plan Logical Optimization • Applies rule-based optimizations to the logical plan: • Constant folding • Predicate pushdown • Projection pruning • Null propagation • Boolean expression simplification • [Others] • Example: a 12-line rule optimizes LIKE expressions with simple regular expressions into String.startsWith or String.contains calls.
  • 75. Catalyst: Physical Planning 75 • Spark SQL takes a logical plan and generations one or more physical plans using physical operators that match the Spark Execution engine: 1. mapPartitions() 2. new ShuffledRDD 3. zipPartitions() • Currently cost-based optimization is only used to select a join algorithm • Broadcast join • Traditional join • Physical planner also performs rule-based physical optimizations like pipelining projections or filters into one Spark map operation • It can also push operations from logical plan into data sources (predicate pushdown) Optimized Logical Plan Physical Planning Physical Plans
  • 77. Catalyst: Code Generation 77 • Generates Java bytecode to run on each machine • Catalyst relies on janino to make code generation simple • (FYI - It used to be quasiquotes, but now is janino)RDDs Selected Physical Plan Code Generation This code gen function converts an expression like (x+y) + 1 to a Scala AST:
  • 79. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity = udf(lambda zipCode: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Augments any DataFrame that contains user_id 79
  • 80. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 80 events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events .where(events.city == "San Francisco") .select(events.timestamp) .collect()
  • 81. 81 def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect() Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users) 81
  • 82. 82 def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect() Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 82
  • 83. Spark 1.5 –Speed / Robustness Project Tungsten – Tightly packed binary structures – Fully-accounted memory with automatic spilling – Reduced serialization costs 83 0 20 40 60 80 100 120 140 160 180 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap
  • 84. 100+ native functions with optimized codegen implementations – String manipulation – concat, format_string, lower, lpad – Data/Time – current_timestamp, date_format, date_add – Math – sqrt, randn – Other – monotonicallyIncreasingId, sparkPartitionId 84 Spark 1.5 – Improved Function Library from pyspark.sql.functions import * yesterday = date_sub(current_date(), 1) df2 = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(), 1) val df2 = df.filter(df("created_at") > yesterday)
  • 85. Window Functions Before Spark 1.4: - 2 kinds of functions in Spark that could return a single value: • Built-in functions or UDFs (round) • take values from a single row as input, and they generate a single return value for every input row • Aggregate functions (sum or max) • operate on a group of rows and calculate a single return value for every group New with Spark 1.4: • Window Functions (moving avg, cumulative sum) • operate on a group of rows while still returning a single value for every input row.
  • 86.
  • 87. Streaming DataFrames Umbrella ticket to track what's needed to make streaming DataFrame a reality: https://issues.apache.org/jira/browse/SPARK-8360

Hinweis der Redaktion

  1. This saturated both disk and network layers
  2. Old Spark API (T&A) is based on Java/Python objects - this makes it hard for the engine to store compactly (java objects in memory have a lot of extra space for what classes, pointers to various things, etc) - cannot understand semantics of user functions - so if you run a map function over just one field of the data, it still has to read the entire object into memory. Spark doesn't know you only cared about one field.
  3. DataFrames were inspired by previous distributed data frame efforts, including Adatao’s DDF and Ayasdi’s BigDF. However, the main difference from these projects is that DataFrames go through the Catalyst optimizer, enabling optimized execution similar to that of Spark SQL queries.
  4. a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.  I’d say that DataFrame is a result of transformation of any other RDD. Your input RDD might contains strings and numbers. But as a result of transformation you end up with RDD that contains GenericRowWithSchema, which is what DataFrame actually is. So, I’d say that DataFrame is just sort of wrapper around simple RDD, which provides some additional and pretty useful stuff.
  5. To compute an average. I have a dataset that is a list of names and ages. Want to figure out the average age for a given name. So, age distribution for a name…
  6. Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
  7. Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
  8. SparkSQL is the only project (1.4+) can read from multiple version of Hive. Spark 1.5 can read from 0.12 – 1.2 A lot of the hive functionality is useful even if you don’t have a Hive installation! Spark will automatically create a local copy of the Hive metastore so use can use Window functions, Hive UDFS, create persistent tables, To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build.  The specific variant of SQL that is used to parse queries can also be selected using the spark.sql.dialect option. This parameter can be changed using either the setConf method on a SQLContext or by using a SET key=valuecommand in SQL. For a SQLContext, the only dialect available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the default is “hiveql”, though “sql” is also available. 
  9. The following example shows how to construct DataFrames in Python. A similar API is available in Scala and Java.
  10. Once built, DataFrames provide a domain-specific language for distributed data manipulation.  Here is an example of using DataFrames to manipulate the demographic data of a large population of users:
  11. You can also incorporate SQL while working with DataFrames, using Spark SQL. This example counts the number of users in the young DataFrame.
  12.  But its not the same as if you called .cache() on an RDD[Row], since we reencode the data into bytebuffers before calling caching so that there is less pressure on the GC.
  13. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame the full RDD API cannot be released on Dataframes. Per Michael, absolute freedom for users restricts the types of optimizations that we can do.
  14. Finally, a Data Source for reading from JDBC has been added as built-in source for Spark SQL.  Using this library, Spark SQL can extract data from any existing relational databases that supports JDBC.  Examples include mysql, postgres, H2, and more.  Reading data from one of these systems is as simple as creating a virtual table that points to the external table.  Data from this table can then be easily read in and joined with any of the other sources that Spark SQL supports. This functionality is a great improvement over Spark’s earlier support for JDBC (i.e.,JdbcRDD).  Unlike the pure RDD implementation, this new DataSource supports automatically pushing down predicates, converts the data into a DataFrame that can be easily joined, and is accessible from Python, Java, and SQL in addition to Scala.
  15. Twitter and Cloudera merged efforts in 2012 to develop a columnar format Parquet is a column based storage format. It gets its name from the patterns in parquet flooring. Optimized use case for parquet is when you only need a subset of the total columns. Avro is better if you typically scan/read all of the fields in a row in each query. Typically, one of the most expensive parts of reading and writing data is (de)serialization. Parquet supports predicate push-down and schema projection to target specific columns in your data for filtering and reading — keeping the cost of deserialization to a minimum. Parquet compresses better because columns have a fixed data type (like string, integer, Boolean, etc).  it is easier to apply any encoding schemes on columnar data which could even be column specific such as delta encoding for integers and prefix/dictionary encoding for strings. Also, due to the homogeneity in data, there is a lot more redundancy and duplicates in the values in a given column. This allows better compression in comparison to data stored in row format. The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).  Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).  https://parquet.apache.org/documentation/latest/
  16. Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset. Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file. Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk. Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages. Metadata is written after the data to allow for single pass writing. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially. There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol. The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).  Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).  https://parquet.apache.org/documentation/latest/
  17. First, organizations that store lots of data in parquet often find themselves evolving the schema over time by adding or removing columns.  With this release we add a new feature that will scan the metadata for all files, merging the schemas to come up with a unified representation of the data.  This functionality allows developers to read data where the schema has changed overtime, without the need to perform expensive manual conversions. - In Spark 1.4, we plan to provide an interface that will allow other formats, such as ORC, JSON and CSV, to take advantage of this partitioning functionality.
  18. On the builder you can specific methods… Like do you want to overwrite data already there? Load or save or saveAsTable are actions.
  19. Note that by default in Spark SQL, there is a parameter called spark.sql.shuffle.partititions, which sets the # of partitions in a Dataframe after a shuffle (in case the user didn't manually specify it). Currently, Spark does not do any automatic determination of partitions, it just uses the # in that parameter. Doing more automclasses for Databrick stuff is on our roadmap though. You can change this parameter using: sqlContex.setConf(key, value). 1.6 = adaptive shuffle, look at output of map side, then pick # of reducers. Matie and Yin's hack day project.
  20. Case classes are used when creating classes that primarily hold data. When your class is basically a data-holder, case classes simplify your code and perform common work. With case classes, unlike regular classes, we don’t have to use the new keyword when creating an object. The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection (seen in green) and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. - - - - What is a case class (vs a normal class)? Original purpose was used for matching, but it’s used for more now Scala’s version of a java bean (java has classes primary for data (gettings and settings) and there’s classes mostly for operations Case classes are mostly for data Scala can do reflection to establish/infer the schema of the df (seen in green). You’d want to be more robust about parsing CSV in real life. peopleRDD.toDF uses (a) Scala implicits and (b) the type of the RDD (RDD[Person]) to infer the schema Mention that a case class, in Scala, is basically a Scala bean: A container for data, augmented with useful things by the Scala compiler.
  21. Catalyst is a powerful new optimization framework. The Catalyst framework allows the developers behind Spark SQL to rapidly add new optimizations, enabling us to build a faster system more quickly.  Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
  22. Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
  23. At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generatesJVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames. Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users.
  24. Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users. The above chart compares the runtime performance of running group-by-aggregation on 10 million integer pairs on a single machine (source code). Since both Scala and Python DataFrame operations are compiled into JVM bytecode for execution, there is little difference between the two languages, and both outperform the vanilla Python RDD variant by a factor of 5 and Scala RDD variant by a factor of 2.
  25. At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them.  A tree is just a Scala object. - - These classes can be used to build up trees; for example, the tree for the expression x+(1+2), would be represented in Scala code as follows: (See Scala code and Diagram)
  26. Pattern matching is a feature of many functional languages that allows extracting values from potentially nested structures of algebraic data types. - The case keyword here is Scala’s standard pattern matching syntax, and can be used to match on the type of an object as well as give names to extracted values (c1 and c2 here). - The pattern matching expression that is passed to transform is a partial function, meaning that it only needs to match to a subset of all possible input trees.  This ability means that rules only need to reason about the trees where a given optimization applies and not those that do not match. Thus, rules do not need to be modified as new types of operators are added to the system.
  27. Rules (and Scala pattern matching in general) can match… -  In the example above, repeated application would constant-fold larger trees, such as (x+0)+(3+3)
  28. Examples below show why you may want to run a rule multiple times Running rules to fixed point means that each rule can be simple and self-contained, and yet still eventually have larger global effects on a tree.  - In our experience, functional transformations on immutable trees make the whole optimizer very easy to reason about and debug. They also enable parallelization in the optimizer, although we do not yet exploit this.
  29.  In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators. - Does Catalyst currently have the capability to generate multiple physical plans? You had mentioned at TtT last week that costing is done eagerly to prune branches that are not allowed (aka greedy algorithm).  - Greedy Algorithm works by making the decision that seems most promising at any moment; it never reconsiders this decision, whatever situation may arise later. A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
  30. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. - An abstract syntax tree for the following code for the Euclidean algorithm: while b ≠ 0if a > ba := a − belseb := b − areturn a - -
  31. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. - An abstract syntax tree for the following code for the Euclidean algorithm: while b ≠ 0if a > ba := a − belseb := b − areturn a - -
  32. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  33. Note, this is not cost based. Cost-based optimization is performed by generating multiple plans using rules, and then computing their costs.
  34. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  35. The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. We thus intend to implement richer cost-based optimization in the future.
  36. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  37. As a simple example, consider the Add, Attribute and Literal tree nodes introduced in Section 4.2, which allowed us to write expressions such as (x+y)+1. Without code generation, such expressions would have to be interpreted for each row of data, by walking down a tree of Add, Attribute and Literal nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, we can write a function to translate a specific expression tree to a Scala AST as follows: The strings beginning with q are quasiquotes, meaning that although they look like strings, they are parsed by the Scala compiler at compile time and represent ASTs for the code within. Quasiquotes can have variables or other ASTs spliced into them, indicated using $ notation. For example, Literal(1) would become the Scala AST for 1, whileAttribute("x") becomes row.get("x"). In the end, a tree like Add(Literal(1), Attribute("x")) becomes an AST for a Scala expression like 1+row.get("x").
  38. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  39. Sometimes you want to call complex functions to do additional work inside of the SQL queries. UDFs can be inlined in the DataFrame code UDF zipToCity just invokes a lamda function that takes a zipCode and does some custom logic to figure out which city the zip code is located in. I have a function called add demographics, which takes a data frame w/ a user ID and will automatically compute a bunch of demographic information. So, we do a join based on UserID and then adds a new column with the .withColumn… the UDF results will add a new column. This def returns a new DataFrame
  40. All of this is lazy, so SparkSQL can do optimizations much later. For this type of machine learning I’m doing, I may only need the ts column from San Francisco. Note that add_demographics does have extra functionality to filter down to just SF and ts.
  41. So maybe add_demographics was by my co-worker and I just want to use it. So we construct a logical query plan. Since this planning is happening at the logical level, optimizations can even occur across function calls, as shown in the example below. In this example, Spark SQL is able to push the filtering of users by their location through the join, greatly reducing its cost to execute.  This optimization is possible even though the original author of the add_demographics function did not provide a parameter for specifying how to filter users! - Ideally we want to filter the users ahead of time based on the extra predicates, and only do the join on the relevent users.
  42. Even cooler, if I want to optimize this later on… So, here I changed to a partitioned hive table (users) and also used parquet instead of JSON (events). Now with Parquet, SparkSQL notices there are new optimizations that it can now do
  43. Idea of Project Tungsten is to reimagine the execution engine for SparkSQL As a user, when you move to 1.5 you will see significant robustness and speed improvements
  44. Before you had to resort to Hive UDFs or drop into SQL… But now there are 100+ native functions added that at runtime Java bytecode will be constructed to evaluate whatever you need.
  45. Pass a physical plan generated by Catalyst into Streaming