SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Sparser: Fast Analytics on Raw Data
by Avoiding Parsing
Shoumik Palkar, Firas Abuzaid, Peter Bailis, Matei Zaharia
Motivation
Bigger unstructured datasets and faster hardware.
Time
Speed
Time
DataVolume
Parsing unstructured data before querying it
is often very slow.
Today: Spark Data Sources API
Spark Core Engine
Push part of query into the data source.
Today: Spark Data Sources API
+ E.g., column pruning directly in Parquet data loader
- Little support for unstructured formats (e.g., can’t avoid JSON parsing)
Spark Core Engine
Data Source API
Push part of query into the data source.
Parsing: A Computational Bottleneck
Example: Existing state-of-the-
art JSON parsers 100x slower
than scanning a string!
113836
56796
360
0
50000
100000
150000
RapidJSON
Parsing
Mison Index
Contruction
String Scan
Cycles/5KBRecord
State-of-the-art Parsers
*Similar results on binary formats
like Avro and Parquet
Parsing seems to be a necessary evil:
how do we get around doing it?
Key Opportunity: High Selectivity
High selectivity especially true
for exploratory analytics.
0
0.2
0.4
0.6
0.8
1
1.E-09 1.E-05 1.E-01
CDF
Selectivity
Databricks Censys
40% of customer Spark queries at Databricks select < 20% of data
99% of queries in Censys select < 0.001% of data
Data Source API provides access to
query filters at data source!
How can we exploit high selectivity to accelerate parsing?
Sparser: Filter Before You Parse
Raw DataFilter
Raw DataFilter
Raw DataFilter
Raw DataParse
Raw DataParse
Today:
parse full input à slow!
Sparser: Filter before parsing first
using fast filtering functions with false
positives, but no false negatives
Demo
Count Tweets where text contains “Trump” and “Putin”
Sparser in Spark SQL
Spark Core Engine
Data Source API
val df = spark.read.format(“json”)
Sparser in Spark SQL
Spark Core Engine
Data Source API
Sparser Filtering Engine
val df = spark.read.format(“edu.stanford.sparser.json”) Sparser	Data	Source	Reader
(Also	supports	Avro,	Parquet!)
Sparser Overview
Sparser Overview
Raw Filter (“RF”): filtering function on a bytestream with false
positives but no false negatives.
Use an optimizer to combine RFs into a cascade.
But is it in the “text” field? Is the
full word “Trump”?
0x54 0x72 0x75 0x6d 0x70
T r u m b
RF
“TRUM”
Pass!
Other
RFs
Example: Tweets WHERE
text contains “Trump”
Full
Parser
Key Challenges in Using RFs
1. How do we implement the RFs efficiently?
2. How do we efficiently choose the RFs to maximize parsing
throughput for a given query and dataset?
Rest of this Talk
• Sparser’s API
• Sparser’s Raw Filter Designs (Challenge 1)
• Sparser’s Optimizer: Choosing a Cascade (Challenge 2)
• Performance Results on various file formats (e.g., JSON, Avro)
Sparser API
Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
Conjunctions
WHERE user.name = “Trump”
AND user.verified = true
Disjunctions
WHERE user.name = “Trump”
OR user.name = “Obama”
Sparser API
Filter Example
Exact String Match
WHERE user.name = “Trump”
WHERE likes = 5000
Contains String WHERE text contains “Trum”
Contains Key WHERE user.url != NULL
Conjunctions
WHERE user.name = “Trump”
AND user.verified = true
Disjunctions
WHERE user.name = “Trump”
OR user.name = “Obama”
Currently does not support numerical range-based predicates.
Challenge 1: Efficient RFs
Raw Filters in Sparser
SIMD-based filtering functions that pass or discard a record
by inspecting raw bytestream
113836
56796
360
0
50000
100000
150000
RapidJSON Parsing Mison Index Contruction Raw Filter
Cycles/5KBRecord
Example RF: Substring Search
Search for a small (e.g., 2, 4, or 8-byte) substring of a query
predicate in parallel using SIMD
On modern CPUs: compare 32 characters in parallel in ~4 cycles (2ns).
Length 4 Substring
packed in SIMD register
Input : “ t e x t ” : “ I j u s t m e t M r . T r u m b ! ! ! ”
Shift 1 : T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - - -
Shift 2 : - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - -
Shift 3 : - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - -
Shift 4 : - - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - -
Example query: text contains “Trump”
False positives (found “Trumb” by accident),
but no false negatives (No “Trum” ⇒ No “Trump”)
Other	RFs	also	possible!	Sparser	selects	them	agnostic	of	implementation.
Key-Value Search RF
Searches for key, and if key is found, searches for value until
some stopping point. Searches occur with SIMD.
Only applicable for exact matches
Useful for queries with common substrings (e.g., favorited=true)
Key: name Value: Trump Delimiter: ,
Ex 1: “name”: “Trump”,
Ex 2: “name”: “Actually, Trump”
Ex 3: “name”: “My name is Trump”
(Pass)
(Fail)
(Pass, False positive)
Second Example would result in false negative if we allow substring matches
Other	RFs	also	possible!	Sparser	selects	them	agnostic	of	implementation.
Challenge 2: Choosing RFs
Choosing RFs
To decrease false positive rate, combine RFs into a cascade
Sparser uses an optimizer to choose a cascade
Raw DataFilter
Raw DataFilter
Raw DataFilter
Raw
DataParse
Raw DataFilter
Raw DataParse
Raw DataFilter
Raw
DataParse
Raw DataFilter
Raw DataFilter
vs. vs.
Sparser’s Optimizer
Step 2: Measure
Params on Sample
of Records
raw bytestream
0010101000110101
RF 1
RF fails
RF 2
X
X
RF fails
Filtered bytes
sent to full parser
Step 4: Apply
Chosen Cascade
for sampled
records:
1 = passed
0 = failed
Step 3: Score and
Choose Cascade
…
(name = "Trump" AND text contains "Putin")
Step 1: Compile
Possible RFs from
predicates
RF1: "Trump"
RF2: "Trum"
RF3: "rump"
RF4: "Tr"
…
RF5: "Putin"
C (RF1) = 4
C (RF1àRF2) = 1
C (RF2àRF3) = 6
C (RF1àRF3) = 9
…
S1 S2 S3
RF 1 0 1 1
RF 2 1 0 1
Sparser’s Optimizer: Configuring RF Cascades
Three high level steps:
1. Convert a query predicate to a set of RFs
2. Measure the passthrough rate and runtime of each RF and
the runtime of the parser on sample of records
3. Minimize optimization function to find min-cost cascade
1. Converting Predicates to RFs
Running Example
(name = “Trump” AND text contains “Putin”)
OR name = “Obama”
Possible RFs*
Substring Search: Trum, Tr, ru, …, Puti, utin, Pu, ut, …, Obam, Ob, …
* Translation from predicate to RF is format-dependent:
examples are for JSON data
2. Measure RFs and Parser
Evaluate RFs and parser on a sample of records.
*Passthrough	rates	of	RFs	are	non-independent! But	too	expensive	to	
measure	rate	of	each	cascade	(combinatorial	search	space)
RF
RF
Parser
Passthrough rates*
and runtimes
2. Measure RFs and Parser
Passthrough rates of RFs are non-independent! But too
expensive to rate of each cascade (combinatorial search
space)
Solution: Store rates as bitmaps.
Pr[a] ∝ 1s in Bitmap of RF a: 0011010101
Pr[b] ∝ 1s in Bitmap of RF b: 0001101101
Pr[a,b] ∝ 1s in Bitwise-& of a, b: 0001000101
foreach sampled record i:
foreach RF j:
BitMap[i][j] = 1 if RF j passes on i
3. Choosing Best Cascade
Running Example
(name = “Trump” AND text contains “Putin”)
OR name = “Obama”
P
Trum
utinObam
PD Obam
PD
1. Valid cascade 2. Valid cascade
utin
Obam
PD
P
3. Invalid cascade (utin must
consider Obam when it fails)
Trum
utinObam
PD PD
Pass (right branch)Fail (left branch)Parse PDiscard D
Cascade cost
computed by
traversing tree from
root to leaf.
Each RF has a
runtime/passthrough
rate we measured in
previous step.
Probability and cost of
executing full parser
Probability and cost
of executing RF i
Cost of Cascade
with RFs 0,1,…,R
3. Choosing Best Cascade
𝐶" = $ Pr	[ 𝑒𝑥𝑒𝑐𝑢𝑡𝑒.]
.∈"
×𝑐. + Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 ×𝑐45678
P
Trum
utinObam
PD Obam
PD
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒96:; = 1
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒:=.> = Pr	[Trum]
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒BC5; = Pr	[¬Trum]	+	Pr[Trum,¬utin]
Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 = Pr	[¬Trum,Obam]	+	Pr[Trum,utin]	+	Pr[Trum,¬utin,Obam]
Choose cascade with minimum cost: Sparser considers up to min(32,
# ORs) RFs and considers up to depth of 4 cascades.
Spark Integration
Sparser in Spark SQL
Spark Core Engine
Data Source API
Sparser Filtering Engine
Sparser in Spark SQL
Spark Core Engine
Data Source API
Sparser Filtering Engine
SQL Query to Candidate
Raw Filters
Native HDFS File Reader
Sparser Optimizer + Filtering
Raw Data
DataFrame
Performance Results
Evaluation Setup
Datasets
Censys - Internet port scan data used in network security
Tweets - collected using Twitter Stream API
Distributed experiments
on 20-node cluster with 4 Broadwell vCPUs/26GB of memory, locally
attached SSDs (Google Compute Engine)
Single node experiments
on Intel Xeon E5-2690 v4 with 512GB memory
Queries
Twitter queries from other academic work. Censys queries sampled randomly from top-3000
queries. PCAP and Bro queries from online tutorials/blogs.
Results: Accelerating End-to-End Spark Jobs
0
200
400
600
Disk Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Runtime(seconds)
Spark Spark + Sparser Query Only
Censys queries on 652GB of JSON data: up to 4x speedup by using Sparser.
1.1 1.3 0.7 1.4
0
20
40
60
80
Disk Q1 Q2 Q3 Q4
Runtime
(seconds)
Spark Spark + Sparser Query Only
Twitter queries on
68GB of JSON data:
up to 9x speedup
by using Sparser.
Results: Accelerating Existing JSON Parsers
1
10
100
1 2 3 4 5 6 7 8 9
Runtime(s,log10)
Sparser + RapidJSON Sparser + Mison RapidJSON Mison
Censys queries compared against two state-of-the-art parsers
(Mison based on SIMD): Sparser accelerates them by up to 22x.
Results: Accelerating Binary Format Parsing
0
0.5
1
1.5
Q1 Q2 Q3 Q4
Runtime
(seconds)
avro-c Sparser + avro-c
0
0.2
Q1 Q2 Q3 Q4
Runtime
(seconds)
parquet-cpp Sparser + parquet-cpp
Sparser accelerates Avro (above) and Parquet (below) based queries by up to 5x and 4.3x respectively.
Results: Domain-Specific Tasks
0
1
2
3
4
0.01
0.1
1
10
100
1000
P1 P2 P3 P4
Speedupover
libpcap
Runtime(s,log10)
Sparser libpcap Tshark tcpdump
Sparser accelerates packet parsing by up to 3.5x compared to standard tools.
Results: Sparser’s Sensitivity to Selectivity
0
20
40
60
80
100
120
0 0.2 0.4 0.6 0.8 1
Runtime(seconds)
Selectivity
RapidJSON Sparser + RapidJSON
Mison Sparser + Mison
Sparser’s
performance
degrades
gracefully as the
selectivity of the
query decreases.
Results: Nuts and Bolts
1
10
100
0 200 400
Runtime(s,log10)
Cascade
Sparser picks a cascade within 5%
of the optimal one using its
sampling-based measurement and
optimizer.
Picked by
Sparser
100
1000
10000
0 100 200
ParseThroughput
(MB/s)
File Offset (100s of MB)
With Resampling No Resampling
Resampling to calibrate the cascade
can improve end-to-end parsing time
by up to 20x.
Conclusion
Sparser:
• Uses raw filters to filter before parsing
• Selects a cascade of raw filters using an efficient optimizer
• Delivers up to 22x speedups over existing parsers
• Will be open source soon!
shoumik@cs.stanford.edu
fabuzaid@cs.stanford.edu
Questions or Comments? Contact Us!

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Flink Forward
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replicationVenu Ryali
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringDatabricks
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 

Was ist angesagt? (20)

Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Using Kafka to scale database replication
Using Kafka to scale database replicationUsing Kafka to scale database replication
Using Kafka to scale database replication
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 

Ähnlich wie Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Firas Abuzaid and Shoumik Palkar

RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang SpecJing Kang
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?Ruben Verborgh
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils FlywebJun Zhao
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...corehard_by
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsRuben Verborgh
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...Amazon Web Services
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Win pcap filtering expression syntax
Win pcap  filtering expression syntaxWin pcap  filtering expression syntax
Win pcap filtering expression syntaxVota Ppt
 
postgres loader
postgres loaderpostgres loader
postgres loaderINRIA-OAK
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Hardware Approaches for Fast Lookup & Classification
Hardware Approaches for Fast Lookup & ClassificationHardware Approaches for Fast Lookup & Classification
Hardware Approaches for Fast Lookup & ClassificationJignesh Patel
 

Ähnlich wie Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Firas Abuzaid and Shoumik Palkar (20)

Serial-War
Serial-WarSerial-War
Serial-War
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Avro intro
Avro introAvro intro
Avro intro
 
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
C++ CoreHard Autumn 2018. Text Formatting For a Future Range-Based Standard L...
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...
AWS re:Invent 2016: How Toyota Racing Development Makes Racing Decisions in R...
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Win pcap filtering expression syntax
Win pcap  filtering expression syntaxWin pcap  filtering expression syntax
Win pcap filtering expression syntax
 
postgres loader
postgres loaderpostgres loader
postgres loader
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Hardware Approaches for Fast Lookup & Classification
Hardware Approaches for Fast Lookup & ClassificationHardware Approaches for Fast Lookup & Classification
Hardware Approaches for Fast Lookup & Classification
 
Understanding TCP and HTTP
Understanding TCP and HTTP Understanding TCP and HTTP
Understanding TCP and HTTP
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 

Kürzlich hochgeladen (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Firas Abuzaid and Shoumik Palkar

  • 1. Sparser: Fast Analytics on Raw Data by Avoiding Parsing Shoumik Palkar, Firas Abuzaid, Peter Bailis, Matei Zaharia
  • 2. Motivation Bigger unstructured datasets and faster hardware. Time Speed Time DataVolume
  • 3. Parsing unstructured data before querying it is often very slow.
  • 4. Today: Spark Data Sources API Spark Core Engine Push part of query into the data source.
  • 5. Today: Spark Data Sources API + E.g., column pruning directly in Parquet data loader - Little support for unstructured formats (e.g., can’t avoid JSON parsing) Spark Core Engine Data Source API Push part of query into the data source.
  • 6. Parsing: A Computational Bottleneck Example: Existing state-of-the- art JSON parsers 100x slower than scanning a string! 113836 56796 360 0 50000 100000 150000 RapidJSON Parsing Mison Index Contruction String Scan Cycles/5KBRecord State-of-the-art Parsers *Similar results on binary formats like Avro and Parquet Parsing seems to be a necessary evil: how do we get around doing it?
  • 7. Key Opportunity: High Selectivity High selectivity especially true for exploratory analytics. 0 0.2 0.4 0.6 0.8 1 1.E-09 1.E-05 1.E-01 CDF Selectivity Databricks Censys 40% of customer Spark queries at Databricks select < 20% of data 99% of queries in Censys select < 0.001% of data Data Source API provides access to query filters at data source!
  • 8. How can we exploit high selectivity to accelerate parsing?
  • 9. Sparser: Filter Before You Parse Raw DataFilter Raw DataFilter Raw DataFilter Raw DataParse Raw DataParse Today: parse full input à slow! Sparser: Filter before parsing first using fast filtering functions with false positives, but no false negatives
  • 10. Demo Count Tweets where text contains “Trump” and “Putin”
  • 11. Sparser in Spark SQL Spark Core Engine Data Source API val df = spark.read.format(“json”)
  • 12. Sparser in Spark SQL Spark Core Engine Data Source API Sparser Filtering Engine val df = spark.read.format(“edu.stanford.sparser.json”) Sparser Data Source Reader (Also supports Avro, Parquet!)
  • 14. Sparser Overview Raw Filter (“RF”): filtering function on a bytestream with false positives but no false negatives. Use an optimizer to combine RFs into a cascade. But is it in the “text” field? Is the full word “Trump”? 0x54 0x72 0x75 0x6d 0x70 T r u m b RF “TRUM” Pass! Other RFs Example: Tweets WHERE text contains “Trump” Full Parser
  • 15. Key Challenges in Using RFs 1. How do we implement the RFs efficiently? 2. How do we efficiently choose the RFs to maximize parsing throughput for a given query and dataset? Rest of this Talk • Sparser’s API • Sparser’s Raw Filter Designs (Challenge 1) • Sparser’s Optimizer: Choosing a Cascade (Challenge 2) • Performance Results on various file formats (e.g., JSON, Avro)
  • 17. Sparser API Filter Example Exact String Match WHERE user.name = “Trump” WHERE likes = 5000
  • 18. Sparser API Filter Example Exact String Match WHERE user.name = “Trump” WHERE likes = 5000 Contains String WHERE text contains “Trum”
  • 19. Sparser API Filter Example Exact String Match WHERE user.name = “Trump” WHERE likes = 5000 Contains String WHERE text contains “Trum” Contains Key WHERE user.url != NULL
  • 20. Sparser API Filter Example Exact String Match WHERE user.name = “Trump” WHERE likes = 5000 Contains String WHERE text contains “Trum” Contains Key WHERE user.url != NULL Conjunctions WHERE user.name = “Trump” AND user.verified = true Disjunctions WHERE user.name = “Trump” OR user.name = “Obama”
  • 21. Sparser API Filter Example Exact String Match WHERE user.name = “Trump” WHERE likes = 5000 Contains String WHERE text contains “Trum” Contains Key WHERE user.url != NULL Conjunctions WHERE user.name = “Trump” AND user.verified = true Disjunctions WHERE user.name = “Trump” OR user.name = “Obama” Currently does not support numerical range-based predicates.
  • 23. Raw Filters in Sparser SIMD-based filtering functions that pass or discard a record by inspecting raw bytestream 113836 56796 360 0 50000 100000 150000 RapidJSON Parsing Mison Index Contruction Raw Filter Cycles/5KBRecord
  • 24. Example RF: Substring Search Search for a small (e.g., 2, 4, or 8-byte) substring of a query predicate in parallel using SIMD On modern CPUs: compare 32 characters in parallel in ~4 cycles (2ns). Length 4 Substring packed in SIMD register Input : “ t e x t ” : “ I j u s t m e t M r . T r u m b ! ! ! ” Shift 1 : T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - - - Shift 2 : - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - - Shift 3 : - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - - Shift 4 : - - - T r u m T r u m T r u m T r u m T r u m T r u m T r u m - - - - Example query: text contains “Trump” False positives (found “Trumb” by accident), but no false negatives (No “Trum” ⇒ No “Trump”) Other RFs also possible! Sparser selects them agnostic of implementation.
  • 25. Key-Value Search RF Searches for key, and if key is found, searches for value until some stopping point. Searches occur with SIMD. Only applicable for exact matches Useful for queries with common substrings (e.g., favorited=true) Key: name Value: Trump Delimiter: , Ex 1: “name”: “Trump”, Ex 2: “name”: “Actually, Trump” Ex 3: “name”: “My name is Trump” (Pass) (Fail) (Pass, False positive) Second Example would result in false negative if we allow substring matches Other RFs also possible! Sparser selects them agnostic of implementation.
  • 27. Choosing RFs To decrease false positive rate, combine RFs into a cascade Sparser uses an optimizer to choose a cascade Raw DataFilter Raw DataFilter Raw DataFilter Raw DataParse Raw DataFilter Raw DataParse Raw DataFilter Raw DataParse Raw DataFilter Raw DataFilter vs. vs.
  • 28. Sparser’s Optimizer Step 2: Measure Params on Sample of Records raw bytestream 0010101000110101 RF 1 RF fails RF 2 X X RF fails Filtered bytes sent to full parser Step 4: Apply Chosen Cascade for sampled records: 1 = passed 0 = failed Step 3: Score and Choose Cascade … (name = "Trump" AND text contains "Putin") Step 1: Compile Possible RFs from predicates RF1: "Trump" RF2: "Trum" RF3: "rump" RF4: "Tr" … RF5: "Putin" C (RF1) = 4 C (RF1àRF2) = 1 C (RF2àRF3) = 6 C (RF1àRF3) = 9 … S1 S2 S3 RF 1 0 1 1 RF 2 1 0 1
  • 29. Sparser’s Optimizer: Configuring RF Cascades Three high level steps: 1. Convert a query predicate to a set of RFs 2. Measure the passthrough rate and runtime of each RF and the runtime of the parser on sample of records 3. Minimize optimization function to find min-cost cascade
  • 30. 1. Converting Predicates to RFs Running Example (name = “Trump” AND text contains “Putin”) OR name = “Obama” Possible RFs* Substring Search: Trum, Tr, ru, …, Puti, utin, Pu, ut, …, Obam, Ob, … * Translation from predicate to RF is format-dependent: examples are for JSON data
  • 31. 2. Measure RFs and Parser Evaluate RFs and parser on a sample of records. *Passthrough rates of RFs are non-independent! But too expensive to measure rate of each cascade (combinatorial search space) RF RF Parser Passthrough rates* and runtimes
  • 32. 2. Measure RFs and Parser Passthrough rates of RFs are non-independent! But too expensive to rate of each cascade (combinatorial search space) Solution: Store rates as bitmaps. Pr[a] ∝ 1s in Bitmap of RF a: 0011010101 Pr[b] ∝ 1s in Bitmap of RF b: 0001101101 Pr[a,b] ∝ 1s in Bitwise-& of a, b: 0001000101 foreach sampled record i: foreach RF j: BitMap[i][j] = 1 if RF j passes on i
  • 33. 3. Choosing Best Cascade Running Example (name = “Trump” AND text contains “Putin”) OR name = “Obama” P Trum utinObam PD Obam PD 1. Valid cascade 2. Valid cascade utin Obam PD P 3. Invalid cascade (utin must consider Obam when it fails) Trum utinObam PD PD Pass (right branch)Fail (left branch)Parse PDiscard D Cascade cost computed by traversing tree from root to leaf. Each RF has a runtime/passthrough rate we measured in previous step.
  • 34. Probability and cost of executing full parser Probability and cost of executing RF i Cost of Cascade with RFs 0,1,…,R 3. Choosing Best Cascade 𝐶" = $ Pr [ 𝑒𝑥𝑒𝑐𝑢𝑡𝑒.] .∈" ×𝑐. + Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 ×𝑐45678 P Trum utinObam PD Obam PD Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒96:; = 1 Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒:=.> = Pr [Trum] Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒BC5; = Pr [¬Trum] + Pr[Trum,¬utin] Pr 𝑒𝑥𝑒𝑐𝑢𝑡𝑒45678 = Pr [¬Trum,Obam] + Pr[Trum,utin] + Pr[Trum,¬utin,Obam] Choose cascade with minimum cost: Sparser considers up to min(32, # ORs) RFs and considers up to depth of 4 cascades.
  • 36. Sparser in Spark SQL Spark Core Engine Data Source API Sparser Filtering Engine
  • 37. Sparser in Spark SQL Spark Core Engine Data Source API Sparser Filtering Engine SQL Query to Candidate Raw Filters Native HDFS File Reader Sparser Optimizer + Filtering Raw Data DataFrame
  • 39. Evaluation Setup Datasets Censys - Internet port scan data used in network security Tweets - collected using Twitter Stream API Distributed experiments on 20-node cluster with 4 Broadwell vCPUs/26GB of memory, locally attached SSDs (Google Compute Engine) Single node experiments on Intel Xeon E5-2690 v4 with 512GB memory
  • 40. Queries Twitter queries from other academic work. Censys queries sampled randomly from top-3000 queries. PCAP and Bro queries from online tutorials/blogs.
  • 41. Results: Accelerating End-to-End Spark Jobs 0 200 400 600 Disk Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Runtime(seconds) Spark Spark + Sparser Query Only Censys queries on 652GB of JSON data: up to 4x speedup by using Sparser. 1.1 1.3 0.7 1.4 0 20 40 60 80 Disk Q1 Q2 Q3 Q4 Runtime (seconds) Spark Spark + Sparser Query Only Twitter queries on 68GB of JSON data: up to 9x speedup by using Sparser.
  • 42. Results: Accelerating Existing JSON Parsers 1 10 100 1 2 3 4 5 6 7 8 9 Runtime(s,log10) Sparser + RapidJSON Sparser + Mison RapidJSON Mison Censys queries compared against two state-of-the-art parsers (Mison based on SIMD): Sparser accelerates them by up to 22x.
  • 43. Results: Accelerating Binary Format Parsing 0 0.5 1 1.5 Q1 Q2 Q3 Q4 Runtime (seconds) avro-c Sparser + avro-c 0 0.2 Q1 Q2 Q3 Q4 Runtime (seconds) parquet-cpp Sparser + parquet-cpp Sparser accelerates Avro (above) and Parquet (below) based queries by up to 5x and 4.3x respectively.
  • 44. Results: Domain-Specific Tasks 0 1 2 3 4 0.01 0.1 1 10 100 1000 P1 P2 P3 P4 Speedupover libpcap Runtime(s,log10) Sparser libpcap Tshark tcpdump Sparser accelerates packet parsing by up to 3.5x compared to standard tools.
  • 45. Results: Sparser’s Sensitivity to Selectivity 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Runtime(seconds) Selectivity RapidJSON Sparser + RapidJSON Mison Sparser + Mison Sparser’s performance degrades gracefully as the selectivity of the query decreases.
  • 46. Results: Nuts and Bolts 1 10 100 0 200 400 Runtime(s,log10) Cascade Sparser picks a cascade within 5% of the optimal one using its sampling-based measurement and optimizer. Picked by Sparser 100 1000 10000 0 100 200 ParseThroughput (MB/s) File Offset (100s of MB) With Resampling No Resampling Resampling to calibrate the cascade can improve end-to-end parsing time by up to 20x.
  • 47. Conclusion Sparser: • Uses raw filters to filter before parsing • Selects a cascade of raw filters using an efficient optimizer • Delivers up to 22x speedups over existing parsers • Will be open source soon! shoumik@cs.stanford.edu fabuzaid@cs.stanford.edu Questions or Comments? Contact Us!