SlideShare ist ein Scribd-Unternehmen logo
1 von 126
Interactive Programs
Debugging and
Development in
Apache Spark
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
๏ Debugging data processing logic in Data-Intensive Scalable Computing
(DISC) system is difficult
๏ Analysis tools are still in their “infancy”
๏ Today’s large-scale jobs are black boxes:
• Job submitted to a cluster
• Results come back minutes to hours later
• No visibility into running algorithm
Big Data Debugging
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
Big Data Debugging - State of the Art
๏ Easy to use GDB-like debugger [ICSE 16] (not covered in this talk)
๏ Visibility of data into running workflow
• E.g., what (input) data led to this (outlier) result?
๏ Selectively replaying a portion of the data processing steps on subsets
of intermediate data leading to outliers results
๏ Interactive program analysis
Big Data Debugging - Desiderata
๏ Visibility of data -> Tracking the dependencies between the
individual inputs and outputs records
๏ Selective replay -> Storage of intermediate results:
• Dataset shared among running job and analysis tool
๏ Interactivity -> Implementation Constraints:
• Latency constraint - In memory computation
• Programming interface constraint - Integration with Spark DSL
Big Data Debugging - Challenges
๏ Well known technique in databases
๏ Two granularities of provenance
• Transformation (coarse-grained) provenance
– Records the complete workflow of the derivation of a dataset
– Spark RDD lineage is an example of this form of provenance
• Data (fine-grained) provenance
– Records data dependencies between input and output records
– The type of provenance Titian focuses on
Data Provenance (Lineage)
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Data Provenance - Example
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT AVG(temp),time

FROM sensors

GROUP BY time
Sensors
Result-ID Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why
ID-2 and ID-3
have those high
Data Provenance - Example
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
Previous Data Provenance DISC Systems
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Previous Data Provenance DISC Systems
๏ They use external storage systems (HDFS in
RAMP [CIDR-11], DBMS in Newt [SOCC-13]) to
retain lineage data
๏ Data provenance queries are supported in a
separate programming interface
High overhead
Low interactivity
Previous Data Provenance DISC Systems
๏ Word Count job
๏ RAMP is up to 4X Spark
๏ Newt up to 86X
Experience with Newt and RAMP
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Newt
RAMP
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
Loads error messages from a log, counts the
number of errors occurrences and returns a report
containing the description of each error
lc = new LineageContext(sc)
lines = lc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
reports.collect.foreach(println)
Example: Log Analysis
Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
Example: Backward Tracing
Given the result of the previous example, select the
most frequent error and trace back to the input
lines containing them
frequentPair = reports.sortBy(_._2, false).take(1)
frequent = reports.filter(_ == frequentPair)
lineage = frequent.getLineage()
input = lineage.goBackAll()
input.collect().foreach(println)
Example: Backward Tracing
Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
Example: Forward Tracing
Return the error codes generated from the network
sub-system (indicated in the log by a “NETWORK” tag)
network = errors.filter(_.contains(“NETWORK”))
lineage = network.getLineage()
output = lineage.goNextAll()
output.collect().foreach(println)
Example: Forward Tracing
Return the error distribution without the ones cause by
the Guest user
Example: Selective Replay
Return the error distribution without the ones cause by
the Guest user
lineage = reports.getLineage()
inputLines = lineage.goBackAll()
noGuest = inputLines.filter(!_.contains(“Guest”) && _.startswith(“error”))
newCodes = noGuest.map(_.split(“t”)(1))
newPairs = newCodes.map(word =>(word, 1))
newCounts = newPairs.reduceByKey(word =>(_ + _))
newRep = newCounts.map(kv => (dscr(kv._1), kv._2))
newRep.collect
Example: Selective Replay
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Vega
‣ Conclusions
๏ LineageContext wrap SparkContext
• Providing visibility into the submitted job
๏ Instrument LineageRDD at stage boundaries
• Wrap native RDDs
• Specific LineageRDD implementation based on instrument transformation
๏ Provenance data is buffered inside LineageRDDs
• Saved into Spark BlockManager for querying
Provenance Capturing
countspairscodeserrorslines
Stage 1 Stage 2
reports
lines = sc.textFile(“hdfs://...”)
errors = lines.filter(_.startswith(“error”))
codes = errors.map(_.split(“t”)(1))
pairs = codes.map(word =>(word, 1))
counts = pairs.reduceByKey(word =>(_ + _))
reports = counts.map(kv => (dscr(kv._1), kv._2))
Spark Stage DAG
Instrumented Spark Stage DAG
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Instrumented Workflow
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input
ID
Output
ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Lineage Capture Runtime Overheads
100
1000
1 10 100
Time(s)
Dataset Size (GB)
Spark
Titian
Newt
RAMP
๏ Same Word Count job
๏ Titian is in average 1.3X slower than Spark
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Captured Data Lineage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Stage.Input IDReducer.Output ID
Reducer.Output IDCombiner.Output ID
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Example: Trace Back
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Hadoop Combiner Reducer Stage
Combiner.Input IDHadoop.Output ID
Now let’s do it for real!
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Example: Trace Back
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Stage.Input IDReducer.Output ID
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Targeted Shuffle
Example: Trace Back
Worker1
Worker2
Worker3
Input ID Output ID
400 id1
4 id2
Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Example: Trace Back
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Combiner.Output ID Reducer.Output ID
Combiner.Output ID Reducer.Output ID
Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Example: Trace Back
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker2
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Combiner.Input IDHadoop.Output ID
Combiner.Input IDHadoop.Output ID
Tracing Performance
๏ Word Count job
๏ Tracing one record backward in < 1 sec for
dataset < 100GB
๏ 18 sec for 500GB dataset
Vega: Optimizations
for Selective Replay
Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor
Miryung Kim, Todd Millstein, Tyson Condie
Under Submission
Debugging workflow
๏ Run program
๏ Understand the cause for bugs / outliers:
• Lineage
• Breakpoints/watchpoints
• Crash culprit
๏ Fix bug
• Fast selective replay
}
} Titian [VLDB 2016]
BigDebug [ICSE 2016]
First Strategy
Convert changes in code to changes in data
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
countspairslines
Stage 1 Stage 2
shuffle
input .map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan
Inject a filter in the workflow
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(aa, 1)
Shuffle
(aa, [1, 1])
(b, 1)
Reduce
(aa, 2)
(b, 1)
Filter
aa
b
aa
countspairslines
Stage 1 Stage 2
shufflefilter
input .filter(x=>x!=‘c’).map(x=>(x,1)) .reduceByKey(_+_)
Incremental Plan
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
Incremental Plan
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
Incremental Plan
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
δFilter
—c
—c
∆Map
—(c, 1)
—(c, 1)
∆Shuffle
c, [—1, —1])
∆Reduce
—(c, 2)
Performance
Input data size (GB)
Time (s)
About 10X faster
Performance
๏ Good up to a certain point
๏ Two factors dominate:
• Space utilization
• Time to shuffle deltas
๏ Insight:
• The more downstream the filter is placed, the better the incremental
performance
• Especially beneficial if we can place it past the shuffle
Second Strategy
Push code changes downstream
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
But the input to the filter is (word, 1)
We cannot use the filter anymore
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter
aa
b
c
aa
c
filter(x=>x!=‘c’)
Observe that the map is invertible
We can use the old filter by using the inverse of the map
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
aa
b
c
aa
c
filter’((x, o)=>x!=‘c’)
Rewritten filter
Commutative Rewrite
Input
aa
b
c
aa
c
Map
(aa, 1)
(b, 1)
(c, 1)
(aa, 1)
(c, 1)
Shuffle
(aa, [1, 1])
(b, 1)
(c, [1, 1])
Reduce
(aa, 2)
(b, 1)
(c, 2)
Filter’
(aa, 2)
(b, 1)
filter((x, o)=>x!=‘c’)
Shuffle and Reduce operations preserve keys
Performance
Input data size (GB)
Time
About 1000X faster
Why does it scale so well?
๏ Runtime in the order of output
๏ Output depends on the number of unique words
๏ Unique words << total words
Combining Strategies
๏ Push the changed transform past as many shuffles
as possible with rewrites
• The new transform can be placed only after materialization
points
• By default we materialize shuffle output
• Efficient because Spark already save shuffle output for fault
tolerance
๏ Use delta computation for the remaining workflow
Vega
๏ Built on Spark and Spark SQL (only filter rewrite)
๏ Spark SQL API is unchanged
๏ Spark API includes:
• Functions with inverses (for maps)
• Inverse values (for incremental reduce)
๏ Automatically rewrites workflows using commutativity
and incremental evaluation
๏ Titian provides to Spark users the ability of tracing through program execution
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ Vega provides 1–3 orders magnitude performance gains over rerunning the
computation from scratch
๏ Both provide results in a few seconds for many workflows allowing interactive
usage
} Transformation provenance
Conclusions
Thank you
Outline
‣ Motivating Scenario
‣ Titian Programming Interface
‣ Internals
‣ Performance
‣ Conclusions
Configuration
๏ Two set of experiments:
• Unstructured - grep and word count
• Structured - PigMix queries
๏ Datasets:
• Unstructured: from 500MB to 500GB files contains words generated using a
Zipf distribution from a dictionary of 8000 words
• Structured: we used the PigMix generator to create dataset of sizes ranging
from 1GB to 1TB
๏ Configuration:
• 16 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk
• Spark 1.2.1
Lineage Capture Runtime Overheads
Tracing Performance
๏ Titian provides to Spark users the ability of tracing through
program execution at interactive speed
๏ Features:
• Intermediate results are shared in memory
• Tight integration with the Spark API (LineageRDD)
• Low job overhead
• Efficient lineage query
๏ We believe Titian will open the door to program logic debugging,
iterative data (and program) cleaning, and exploratory analysis
}Transformation provenance
Titian: Data Provenance in Spark
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Capturing: HadoopLineageRDD
Hadoop
LineageRDD
linesInput records Output records
Input ID Output
ID
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
Get input Id
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
Get input Id
Hadoop
LineageRDD
linesoffset1, “error
400 …”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
“error 400
…”
Hadoop
LineageRDD
lines
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Get output Id
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
Capturing: HadoopLineageRDD
Input records Output records
Input ID Output
ID
offset1 id1
Save
Hadoop
LineageRDD
lines “error 400
…”
TaskCont
ext
id1
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2
offset2, “error 4
…”
TaskCont
ext
id1
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
“error 4 …”
TaskCont
ext
id2
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3
offset3, “error
400 …”
TaskCont
ext
id2
Capturing: HadoopLineageRDD
Input records Output records
Hadoop
LineageRDD
lines
Input ID Output
ID
offset1 id1
offset2 id2
offset3 id3
“error 400
…”
TaskCont
ext
id3
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
offset1, “error
400 …”
Ke
y
Input
IDs
Ke
y
Agg
Value
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
“error 400
…”
Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
“error 400
…”
Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
400 Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
(400, 1) Ke
y
Agg
Value
Ke
y
Input
IDs
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1
Ke
y
Agg
Value
40
0
1
Ke
y
Input
IDs
400 { id1 }
offset2, “error
4 …”
Input ID Output
ID
offset1 id1
TaskCont
ext
id1
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
Ke
y
Agg
Value
40
0
1
4 1
Ke
y
Input
IDs
400 { id1 }
4 { id2 }
TaskCont
ext
id2
offset3, “error
400 …”
Combiner
LineageR
DD
Combiner Build Phase
Hadoop
LineageR
DD
pairscodes
error
s
lines
Stage 1Input ID Output
ID
offset1 id1
offset2 id2
offset3 id3
Ke
y
Agg
Value
40
0
2
4 1
Ke
y
Input
IDs
400 { id1,
id3}
4 { id2 }
TaskCont
ext
id3
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Combiner Probe Phase
Input records Output records
Input ID Output
ID
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(400, 2)
Get output Id
Combiner Probe Phase
Input records Output records
Input ID Output
ID
{id1, id
3}
400
{ id2 } 4
Combiner
LineageRDD
pairs
TaskCont
ext
id3
Key Input
IDs
400 { id1,
id3 }
4 { id 2 }
Ke
y
Agg
Value
40
0
2
4 1
(4, 1)
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400,
2)
(4, 1)
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
Combiner
LineageR
DD
Reducer
LineageR
DD
Instrumented Workflow
Hadoop
LineageR
DD
count
s
pairscodes
error
s
lines
Stage 1
Stage 2
repor
ts
Stage
LineageR
DD
Input ID Output
ID
offset1 id1
TaskConte
xt
Id1
Input ID Output
ID
{ id1, id
3}
400
{ id2 } 4
(400, (2,
p1))(4, (1,
p1))(400, (5,
p2))
…
TaskConte
xt
400
Input ID Output
ID
[ p1, p2
]
400
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
Get input Id
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400
TaskCont
ext
400
Get output Id
(Bad request, 7)
Get output Id
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
TaskCont
ext
400
(Bad request, 7)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
4
TaskCont
ext
4
(Failure, 1)
Capturing: StageLineageRDD
Stage
LineageR
DDInput records Output records
Input ID Output
ID
400 id1
4 id2
TaskCont
ext
4
(Failure, 7)

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 

Andere mochten auch

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 

Andere mochten auch (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 

Ähnlich wie Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA

PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
Patrick Allaert
 

Ähnlich wie Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA (20)

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Zone IDA Proc
Zone IDA ProcZone IDA Proc
Zone IDA Proc
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
GCC
GCCGCC
GCC
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 

Mehr von Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in Spark, Matteo Interlandi, PostDoc, UCLA