SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Sempala
Interactive SPARQL Query Processing
on Hadoop
Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen
University of Freiburg, Germany
ISWC 2014 - Riva del Garda, Italy
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2
Motivation
 Semantic Web has arrived in real-world applications
(not only academia & research)
• Web-scale semantic data makes single machine solutions infeasible
 Hadoop ecosystem has become de-facto standard for
Big Data applications
 Our Idea: Use it for Semantic Web purposes as well
• 2 main reasons:
1) Web-scale semantic data requires solutions that scale out
2) Industry has settled on Hadoop (or Hadoop-style) architectures
 superior cost-benefit ratio compared to specialized infrastructures
Motivation
 SPARQL-on-Hadoop
• Existing solutions are mostly based on MapReduce
• Scale very well to billions of triples (or more)
• But MapReduce is batch and thus inherently slow
• Good for unselective ETL-like queries with many results
 SPARQL queries are often explorative and ad-hoc
• Typically selective thus returning only a few results
 Runtimes in the order of hours are not satisfying!
• But interactive runtimes are virtually impossible
to achieve with MapReduce
 Need for interactive SPARQL-on-Hadoop
• Following the trend of interactive SQL-on-Hadoop
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3
Sempala
 SPARQL-on-Hadoop query engine
• Especially designed with ad-hoc style selective queries in mind
• Interactive query runtimes for such queries
• Built on top of Impala, an MPP SQL query engine for Hadoop
• RDF data stored in HDFS using columnar storage format (Parquet)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4
Sempala – Data Layout
 Single unified property table
• Parquet is a column-oriented storage format
 optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5
Sempala – Data Layout
 Single unified property table
• Parquet is a column-oriented storage format
 optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6
(Article2, 2)
run-length encoding
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8
triple group tg1
triple group tg2
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9
triple group tg1
triple group tg2
join group jg1
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10
triple group tg1
triple group tg2
join group jg1
tg1
tg2
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11
triple group tg1
triple group tg2
jg1
tg1
tg2
join group jg1
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14
Evaluation
 Small Cluster with low-end configuration
• 10 machines, 32 GB RAM and 2 disks each, Gigabit network
(Cloudera recommends 256 GB RAM and 12 disks or more)
• CDH 4.5, Impala 1.2.3
 LUBM and BSBM benchmarks
• LUBM up to 3K universities
• BSBM up to 3M products
 Compared Sempala with 4 other Hadoop based systems
• Hive (same Query Compiler as Sempala but Hive as execution engine)
• PigSPARQL (built on top of Apache Pig) [1]
• MapMerge (map-side merge join for SPARQL BGPs) [2]
• MAPSIN (map-side index nested loop join based on Apache HBase) [3]
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15
Evaluation
 LUBM 3K (sec in log scale)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for selective (star-shaped) queries:
• Sempala (8.3) , Hive (89) , PigSPARQL (69.7) , MAPSIN (78.1) , MapMerge (65.8)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for more sophisticated queries:
• Sempala (20.7) , Hive (316.6) , PigSPARQL (266) , MAPSIN (119.5) , MapMerge (175.6)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for unselective queries:
• Sempala (63.5) , Hive (166) , PigSPARQL (52.4) , MAPSIN (101.4) , MapMerge (24.5)
• Runtime of Sempala dominated by storing millions of records in result table
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19
Summary
 Sempala SPARQL query engine for Hadoop
• Built on top of a state-of-the-art MPP SQL query engine (Impala)
• Uses a state-of-the-art columnar storage format (Parquet)
• SPARQL 1.0 (not only BGPs)
 Evaluation
• Outperforms existing Hadoop based systems by an order of magnitude
• Interactive runtimes for selective queries (selective ≠ simple)
(order of seconds, not minutes or even hours)
 Future Work
• Refinement of data layout (nested data structures  Impala 2.1)
• SPARQL 1.1 features (e.g. subqueries, aggregations)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20
References
[1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8
[2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing.
In: CloudCom 2013, pp. 631-638
[3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing.
In: SSWS+HPCSW 2012, pp. 59-74
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21

Weitere ähnliche Inhalte

Was ist angesagt?

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraphDataWorks Summit
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 

Was ist angesagt? (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 

Ähnlich wie Sempala - Interactive SPARQL Query Processing on Hadoop

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...DataWorks Summit
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 

Ähnlich wie Sempala - Interactive SPARQL Query Processing on Hadoop (20)

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 

Kürzlich hochgeladen

20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...nirzagarg
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...SUHANI PANDEY
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...nilamkumrai
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
 

Kürzlich hochgeladen (20)

20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 

Sempala - Interactive SPARQL Query Processing on Hadoop

  • 1. Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy
  • 2. 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2 Motivation  Semantic Web has arrived in real-world applications (not only academia & research) • Web-scale semantic data makes single machine solutions infeasible  Hadoop ecosystem has become de-facto standard for Big Data applications  Our Idea: Use it for Semantic Web purposes as well • 2 main reasons: 1) Web-scale semantic data requires solutions that scale out 2) Industry has settled on Hadoop (or Hadoop-style) architectures  superior cost-benefit ratio compared to specialized infrastructures
  • 3. Motivation  SPARQL-on-Hadoop • Existing solutions are mostly based on MapReduce • Scale very well to billions of triples (or more) • But MapReduce is batch and thus inherently slow • Good for unselective ETL-like queries with many results  SPARQL queries are often explorative and ad-hoc • Typically selective thus returning only a few results  Runtimes in the order of hours are not satisfying! • But interactive runtimes are virtually impossible to achieve with MapReduce  Need for interactive SPARQL-on-Hadoop • Following the trend of interactive SQL-on-Hadoop 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3
  • 4. Sempala  SPARQL-on-Hadoop query engine • Especially designed with ad-hoc style selective queries in mind • Interactive query runtimes for such queries • Built on top of Impala, an MPP SQL query engine for Hadoop • RDF data stored in HDFS using columnar storage format (Parquet) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4
  • 5. Sempala – Data Layout  Single unified property table • Parquet is a column-oriented storage format  optimized for wide tables while selecting only a few columns on request • Sparse columns cause little to no storage overhead in Parquet • No clustering, no joins for star pattern queries • Duplication strategy for multi-valued properties 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5
  • 6. Sempala – Data Layout  Single unified property table • Parquet is a column-oriented storage format  optimized for wide tables while selecting only a few columns on request • Sparse columns cause little to no storage overhead in Parquet • No clustering, no joins for star pattern queries • Duplication strategy for multi-valued properties 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6 (Article2, 2) run-length encoding
  • 7. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7
  • 8. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8 triple group tg1 triple group tg2
  • 9. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9 triple group tg1 triple group tg2 join group jg1
  • 10. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10 triple group tg1 triple group tg2 join group jg1 tg1 tg2
  • 11. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11 triple group tg1 triple group tg2 jg1 tg1 tg2 join group jg1
  • 12. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12
  • 13. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13
  • 14. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14
  • 15. Evaluation  Small Cluster with low-end configuration • 10 machines, 32 GB RAM and 2 disks each, Gigabit network (Cloudera recommends 256 GB RAM and 12 disks or more) • CDH 4.5, Impala 1.2.3  LUBM and BSBM benchmarks • LUBM up to 3K universities • BSBM up to 3M products  Compared Sempala with 4 other Hadoop based systems • Hive (same Query Compiler as Sempala but Hive as execution engine) • PigSPARQL (built on top of Apache Pig) [1] • MapMerge (map-side merge join for SPARQL BGPs) [2] • MAPSIN (map-side index nested loop join based on Apache HBase) [3] 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15
  • 16. Evaluation  LUBM 3K (sec in log scale) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16
  • 17. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for selective (star-shaped) queries: • Sempala (8.3) , Hive (89) , PigSPARQL (69.7) , MAPSIN (78.1) , MapMerge (65.8) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17
  • 18. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for more sophisticated queries: • Sempala (20.7) , Hive (316.6) , PigSPARQL (266) , MAPSIN (119.5) , MapMerge (175.6) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18
  • 19. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for unselective queries: • Sempala (63.5) , Hive (166) , PigSPARQL (52.4) , MAPSIN (101.4) , MapMerge (24.5) • Runtime of Sempala dominated by storing millions of records in result table 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19
  • 20. Summary  Sempala SPARQL query engine for Hadoop • Built on top of a state-of-the-art MPP SQL query engine (Impala) • Uses a state-of-the-art columnar storage format (Parquet) • SPARQL 1.0 (not only BGPs)  Evaluation • Outperforms existing Hadoop based systems by an order of magnitude • Interactive runtimes for selective queries (selective ≠ simple) (order of seconds, not minutes or even hours)  Future Work • Refinement of data layout (nested data structures  Impala 2.1) • SPARQL 1.1 features (e.g. subqueries, aggregations) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20
  • 21. References [1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8 [2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing. In: CloudCom 2013, pp. 631-638 [3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing. In: SSWS+HPCSW 2012, pp. 59-74 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21