SlideShare a Scribd company logo
1 of 11
Download to read offline
Shark attack on SQL-on-
Hadoop
Gerd König
May 27th, 2014
(Big) Data Engineer
Agenda
● SQL-on-Hadoop
○ Why that hype?
○ Tool overview and comparison
○ File formats matters
● Shark
○ Facts & figures
○ What makes the difference?
○ SparkSQL enters the playground
● Hands-On (quick ‘n dirty)
○ File formats & disk usage
○ Execution times (at a rough estimate) / Benchmarking
● Summary
SQL-on-Hadoop - Why that hype?
● Hadoop is widely accepted as “new technology”
● Hadoop gets more and more enterprise ready
● SQL is a well established language for many years
and used by DB developers as well as Business
Analysts
=> Huge demand for SQL(-like) access to Hadoop
SQL-on-Hadoop
A whole bunch of tools (just an excerpt)
SQL-on-Hadoop
Clustering some tools
Shark - Facts & figures
● ...sits on top of Apache Spark
● is tightly coupled with Hive, uses a slightly modified version
● use Hive statements, UDFs and Hive metastore (HCatalog)
● can be run in Shark-shell as well as Shark Server (connect e.g.
via beeline JDBC client)
Shark / SparkSQL
● What makes the difference?
○ Performance increase due to in-memory processing (‘low-latency
M/R’)
○ Interaction with other “Plugins” of the Spark stack, like ML-library,
e.g. call ML functions directly with your SQL resultset:
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.map(extractFeatures(_))
kmeans(featureMatrix)
● SparkSQL - A new star is born?
○ no dependencies to Hive, new type of RDD “SchemaRDD”
○ fires SQL against RDDs, Parquet files, Hive (via Wrappers)
File format matters
● An appropriate file format influences
○ performance, and
○ used disk space
● Use a columnar storage format for columnar data(bases)
○ RCFile, ORC, Parquet
Hands-On
● Part I
○ compare Parquet based table vs. flat file
● Part II
○ execute 1 query in Hive, Impala and Shark
○ get a feeling about runtime...
Further information
● Detailled Benchmarks by Berkeley AmpLab:
https://amplab.cs.berkeley.edu/benchmark/
● Shark
http://shark.cs.berkeley.edu/
● SparkSQL
https://github.com/apache/spark/tree/master/sql
http://people.apache.org/~pwendell/catalyst-docs/sql-
programming-guide.html
THANKS for your attention !
Gerd König
gerd.koenig@ymc.ch
Tel. +41 (0)71 508 24 74
@gerd_koenig
ch.linkedin.com/in/gerdkoenig
Q&A

More Related Content

What's hot

Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
liqiang xu
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
Jesus Rodriguez
 

What's hot (20)

Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
The Do’s and Don’ts of Benchmarking Databases
The Do’s and Don’ts of Benchmarking DatabasesThe Do’s and Don’ts of Benchmarking Databases
The Do’s and Don’ts of Benchmarking Databases
 
NoSQL Slideshare Presentation
NoSQL Slideshare Presentation NoSQL Slideshare Presentation
NoSQL Slideshare Presentation
 
Introducing Scylla Open Source 4.0
Introducing Scylla Open Source 4.0Introducing Scylla Open Source 4.0
Introducing Scylla Open Source 4.0
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Webinar how to build a highly available time series solution with kairos-db (1)
Webinar  how to build a highly available time series solution with kairos-db (1)Webinar  how to build a highly available time series solution with kairos-db (1)
Webinar how to build a highly available time series solution with kairos-db (1)
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...
Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...
Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...
 
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
NoSQL and NewSQL: Tradeoffs between Scalable Performance & ConsistencyNoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
Steering the Sea Monster - Integrating Scylla with Kubernetes
Steering the Sea Monster - Integrating Scylla with KubernetesSteering the Sea Monster - Integrating Scylla with Kubernetes
Steering the Sea Monster - Integrating Scylla with Kubernetes
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Building a REST API with Cassandra on Datastax Astra Using Python and Node
Building a REST API with Cassandra on Datastax Astra Using Python and NodeBuilding a REST API with Cassandra on Datastax Astra Using Python and Node
Building a REST API with Cassandra on Datastax Astra Using Python and Node
 
Cassandra-vs-MongoDB
Cassandra-vs-MongoDBCassandra-vs-MongoDB
Cassandra-vs-MongoDB
 

Viewers also liked

Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
DataWorks Summit
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
Stefanie Zhao
 

Viewers also liked (7)

Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
The internals
The internalsThe internals
The internals
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Sharks powerpoint
Sharks powerpointSharks powerpoint
Sharks powerpoint
 

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014 (20)

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark 101
Spark 101Spark 101
Spark 101
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

  • 1. Shark attack on SQL-on- Hadoop Gerd König May 27th, 2014 (Big) Data Engineer
  • 2. Agenda ● SQL-on-Hadoop ○ Why that hype? ○ Tool overview and comparison ○ File formats matters ● Shark ○ Facts & figures ○ What makes the difference? ○ SparkSQL enters the playground ● Hands-On (quick ‘n dirty) ○ File formats & disk usage ○ Execution times (at a rough estimate) / Benchmarking ● Summary
  • 3. SQL-on-Hadoop - Why that hype? ● Hadoop is widely accepted as “new technology” ● Hadoop gets more and more enterprise ready ● SQL is a well established language for many years and used by DB developers as well as Business Analysts => Huge demand for SQL(-like) access to Hadoop
  • 4. SQL-on-Hadoop A whole bunch of tools (just an excerpt)
  • 6. Shark - Facts & figures ● ...sits on top of Apache Spark ● is tightly coupled with Hive, uses a slightly modified version ● use Hive statements, UDFs and Hive metastore (HCatalog) ● can be run in Shark-shell as well as Shark Server (connect e.g. via beeline JDBC client)
  • 7. Shark / SparkSQL ● What makes the difference? ○ Performance increase due to in-memory processing (‘low-latency M/R’) ○ Interaction with other “Plugins” of the Spark stack, like ML-library, e.g. call ML functions directly with your SQL resultset: val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20") println(youngUsers.count) val featureMatrix = youngUsers.map(extractFeatures(_)) kmeans(featureMatrix) ● SparkSQL - A new star is born? ○ no dependencies to Hive, new type of RDD “SchemaRDD” ○ fires SQL against RDDs, Parquet files, Hive (via Wrappers)
  • 8. File format matters ● An appropriate file format influences ○ performance, and ○ used disk space ● Use a columnar storage format for columnar data(bases) ○ RCFile, ORC, Parquet
  • 9. Hands-On ● Part I ○ compare Parquet based table vs. flat file ● Part II ○ execute 1 query in Hive, Impala and Shark ○ get a feeling about runtime...
  • 10. Further information ● Detailled Benchmarks by Berkeley AmpLab: https://amplab.cs.berkeley.edu/benchmark/ ● Shark http://shark.cs.berkeley.edu/ ● SparkSQL https://github.com/apache/spark/tree/master/sql http://people.apache.org/~pwendell/catalyst-docs/sql- programming-guide.html
  • 11. THANKS for your attention ! Gerd König gerd.koenig@ymc.ch Tel. +41 (0)71 508 24 74 @gerd_koenig ch.linkedin.com/in/gerdkoenig Q&A