Can the elephants handle the no sql onslaught

•

1 gefällt mir•804 views

Aung Thu Rha Hein

Presentation of a decent paper from Jim Grays Lab

Technologie

CAN THE ELEPHANTS HANDLE
THE NO-SQL ONSLAUGHT?

AUNG THU RHA HEIN
G5537871

OUTLINE
 Introduction
 Background
 Evaluation
 Traditional DSS Workload: Hive vs PDW
 Modern OLTP Workload: MongoDB vs SQL Server
 Discussion & Conclusion

INTRODUCTION

 Motivation
How does the performance and scalability of RDBMs solutions compare
to the NoSQL systems?
 Proposition
compare MongoDB(AS/CS) with SQL Server and Hive with SQL PWD,
and analyze the performance and scalability aspects on two workloads
(decision support analysis and interactive data-serving).
 Use YCSB and TPC-H DSS benchmarks respectively

BACKGROUND
 Parallel Data Warehouse (PDW)
 shared-nothing parallel database system built on top of SQL
Server
 multiple compute nodes, a single control node and other
administrative service nodes.
 Hive
 an open-source data warehouse built on top of Hadoop
 a structured data model for data that is stored in the Hadoop
Distributed Filesystem (HDFS), and a SQL-like declarative query
language called HiveQL

BACKGROUND(CONT.)
 MongoDB
Features

 a document-oriented storage layer, indexing in the form of B-
trees, auto-sharding, asynchronous replication of data between
servers.
 Data stored in collections which contain documents
 Each document is serialized using BSON

For implementation, it is created two types of MongoDB servers:
 MongoDB-CS (with client-side sharding )
 MongoDB-AS (Auto-Sharding)

EVALUATION
 Make hardware and software configuration for all four systems
 For PDW and Hive, use 8 disks to store the data
 For YCSB benchmark, 8 nodes are used as servers and another 8 for
client-benchmarks
Hive and Hadoop
 Use RCFile format to store data
 All TPC-H tables are stored in Gzip RcCile format

TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Workload Description
 use TPC-H at 4 scale factors (250,500,1000,4000,16000 GBs)
 TPC-H generator doesn’t produce correct result at 16000 scale
 Executed all 22 TPC-H queries
 But leave 2 TPC-H refresh functions

TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Data Layout in
Hive and PDW

TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Data Preparation and Load Times
Hive
 Generated dataset across 16 nodes
 Create one hive table for each TPC-H table
 Data is loaded in 2 phases:
 data files loaded onto each node
 data is converted from text to RCfile format.
PDW
 Load data into landed node
 Create necessary tables

TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Performance Analysis

TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Performance Analysis(cont.)
 PDW is faster than Hive in for all TPC-H queries
 The average speedup of PDW over Hive is greater for small datasets
 Hive has high overheads for small datasets.

Scalability Analysis
 Hive scales better than PDW
 Hive scales well as the dataset size increases.

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER
Workload description

Extends YCSB into 2 ways:
 added support for multiple instances on many database servers
 Supports for Stored procedures in YCSB JBDC driver
ran the YCSB benchmark on a database that consists of 640 million records

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER
Data Preparation
 Mongo-AS can automatically manage the shards by using a
“balancer” process
 The loading time for SQL-CS and Mongo-CS was 146 and 45
minutes respectively
 SQL load time take longer because a bulk insert method was not
used

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER
Experimental Evaluation

“Read-Only” workload

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER

95% Read
5% Update Workload

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER

50% Read &
50% Update workload

MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER

95% Read
5% Append Workload

DISCUSSION & CONCLUSION
 This evaluation shows that NoSQL systems are still behind RDBMS in
performance.
 PDW is also 9 times faster than Hive running TPC-H at 16TB scale
 SQL-CS was able to achieve higher throughput than MongoDB

AUTHORS
 Avrilia Floratou
University of Wisconsin-Madison
 Nikhil Teletia
Microsoft Jim Gray Systems Lab
 David J. DeWitt
Microsoft Jim Gray Systems Lab
 Jignesh M. Patel
University of Wisconsin-Madison
 Donghui Zhang
Paradigm4

Empfohlen

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

Using Ceph for Large Hadron Collider DataRob Gardner

Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Lviv Startup Club

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon

Discover some "Big Data" architectural concepts with Redis Maturin BADO

An introduction To Apache SparkAmir Sedighi

New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju

Empfohlen

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

Using Ceph for Large Hadron Collider DataRob Gardner

Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Lviv Startup Club

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon

Discover some "Big Data" architectural concepts with Redis Maturin BADO

An introduction To Apache SparkAmir Sedighi

New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju

HBaseCon2017 Apache HBase at DidiHBaseCon

HBaseCon 2015- HBase @ FlipboardMatthew Blair

Data- How Does It Work-Boyang Niu

HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack

RubiXShubham Tagra

Alluxio Data Orchestration Platform for the CloudShubham Tagra

MySQL Live Migration - Common ScenariosMydbops

Hadoop Architecture in DepthSyed Hadoop

Effectively deploying hadoop to the cloudAvinash Ramineni

Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudDatabricks

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Introduction to NoSqlOmid Vahdaty

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

Hybrid collaborative tiered storage with alluxioThai Bui

Introduction to RedisTO THE NEW | Technology

Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.

The Do’s and Don’ts of Benchmarking DatabasesScyllaDB

Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.

How to ensure Presto scalability  in multi use case Kai Sasaki

Mongo presentation confShridhar Joshi

It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan

Why no sql ? Why Couchbase ?Ahmed Rashwan

Weitere ähnliche Inhalte

Was ist angesagt?

HBaseCon2017 Apache HBase at DidiHBaseCon

HBaseCon 2015- HBase @ FlipboardMatthew Blair

Data- How Does It Work-Boyang Niu

HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeMichael Stack

RubiXShubham Tagra

Alluxio Data Orchestration Platform for the CloudShubham Tagra

MySQL Live Migration - Common ScenariosMydbops

Hadoop Architecture in DepthSyed Hadoop

Effectively deploying hadoop to the cloudAvinash Ramineni

Using S3 Select to Deliver 100X Performance Improvements Versus the Public CloudDatabricks

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Introduction to NoSqlOmid Vahdaty

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

Hybrid collaborative tiered storage with alluxioThai Bui

Introduction to RedisTO THE NEW | Technology

Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.

The Do’s and Don’ts of Benchmarking DatabasesScyllaDB

Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.

How to ensure Presto scalability  in multi use case Kai Sasaki

Mongo presentation confShridhar Joshi

Was ist angesagt? (20)

HBaseCon2017 Apache HBase at Didi

HBaseCon 2015- HBase @ Flipboard

Data- How Does It Work-

HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time

RubiX

Alluxio Data Orchestration Platform for the Cloud

MySQL Live Migration - Common Scenarios

Hadoop Architecture in Depth

Effectively deploying hadoop to the cloud

Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud

Presto Summit 2018 - 09 - Netflix Iceberg

Introduction to NoSql

Improve Presto Architectural Decisions with Shadow Cache

Hybrid collaborative tiered storage with alluxio

Introduction to Redis

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

The Do’s and Don’ts of Benchmarking Databases

Enabling Presto Caching at Uber with Alluxio

How to ensure Presto scalability  in multi use case

Mongo presentation conf

Ähnlich wie Can the elephants handle the no sql onslaught

It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan

Why no sql ? Why Couchbase ?Ahmed Rashwan

HiveManas Nayak

Apache drillJakub Pieprzyk

Hadoop_arunam_pptjerrin joseph

Benchmarking Hadoop and Big DataNicolas Poggi

Big Data on the Microsoft PlatformAndrew Brust

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust

Hoodie - DataEngConf 2017Vinoth Chandar

Hive with HDInsightKhalid Salama

Introductive to Hive Rupak Roy

Windows Azure HDInsight ServiceNeil Mackenzie

מיכאלsqlserver.co.il

Deploying Apache Spark and testing big data applications on servers powered b...Principled Technologies

Hive_Pig.pptxPAVANKUMARNOOKALA

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf

Nosql Introduction, BasicsCamellia Ghoroghi

Real-Time Data Loading from MySQL to HadoopContinuent

OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Ähnlich wie Can the elephants handle the no sql onslaught (20)

It takes two to tango! : Is SQL-on-Hadoop the next big step?

Why no sql ? Why Couchbase ?

Hive

Apache drill

Hadoop_arunam_ppt

Benchmarking Hadoop and Big Data

Big Data on the Microsoft Platform

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Hoodie - DataEngConf 2017

Hive with HDInsight

Introductive to Hive

Windows Azure HDInsight Service

מיכאל

Deploying Apache Spark and testing big data applications on servers powered b...

Hive_Pig.pptx

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Nosql Introduction, Basics

Real-Time Data Loading from MySQL to Hadoop

OSDC 2015: John Spray | The Ceph Storage System

Big data vahidamiri-tabriz-13960226-datastack.ir

Mehr von Aung Thu Rha Hein

Writing with easeAung Thu Rha Hein

Bioinformatics for Computer Scientists Aung Thu Rha Hein

Analysis of hybrid image with FFT (Fast Fourier Transform)Aung Thu Rha Hein

Introduction to Common Weakness Enumeration (CWE)Aung Thu Rha Hein

Private Browsing: A Window of Forensic OpportunityAung Thu Rha Hein

Network switchingAung Thu Rha Hein

Digital Forensic: Brief Intro & Research ChallengeAung Thu Rha Hein

Survey & Review of Digital ForensicAung Thu Rha Hein

Partitioned Based Regression VerificationAung Thu Rha Hein

CRAXweb: Automatic Exploit Generation for Web ApplicationsAung Thu Rha Hein

Botnets 101Aung Thu Rha Hein

Session initiation protocolAung Thu Rha Hein

TPC-H in MongoDBAung Thu Rha Hein

Web application security: Threats & CountermeasuresAung Thu Rha Hein

Cloud computing securityAung Thu Rha Hein

Fuzzy logic based students’ learning assessmentAung Thu Rha Hein

Link state routing protocolAung Thu Rha Hein

Chat bot analysisAung Thu Rha Hein

Data mining & column storesAung Thu Rha Hein

Mehr von Aung Thu Rha Hein (19)

Writing with ease

Bioinformatics for Computer Scientists

Analysis of hybrid image with FFT (Fast Fourier Transform)

Introduction to Common Weakness Enumeration (CWE)

Private Browsing: A Window of Forensic Opportunity

Network switching

Digital Forensic: Brief Intro & Research Challenge

Survey & Review of Digital Forensic

Partitioned Based Regression Verification

CRAXweb: Automatic Exploit Generation for Web Applications

Botnets 101

Session initiation protocol

TPC-H in MongoDB

Web application security: Threats & Countermeasures

Cloud computing security

Fuzzy logic based students’ learning assessment

Link state routing protocol

Chat bot analysis

Data mining & column stores

Kürzlich hochgeladen

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Advanced Computer Architecture – An IntroductionDilum Bandara

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

CloudStudio User manual (basic edition):comworks

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Kürzlich hochgeladen (20)

How AI, OpenAI, and ChatGPT impact business and software.

The Ultimate Guide to Choosing WordPress Pros and Cons

Powerpoint exploring the locations used in television show Time Clash

DevoxxFR 2024 Reproducible Builds with Apache Maven

Ensuring Technical Readiness For Copilot in Microsoft 365

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Advanced Computer Architecture – An Introduction

What's New in Teams Calling, Meetings and Devices March 2024

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Nell’iperspazio con Rocket: il Framework Web di Rust!

Search Engine Optimization SEO PDF for 2024.pdf

Dev Dives: Streamline document processing with UiPath Studio Web

Scanning the Internet for External Cloud Exposures via SSL Certs

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Anypoint Exchange: It’s Not Just a Repo!

TeamStation AI System Report LATAM IT Salaries 2024

CloudStudio User manual (basic edition):

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Can the elephants handle the no sql onslaught

1. CAN THE ELEPHANTS HANDLE THE NO-SQL ONSLAUGHT? AUNG THU RHA HEIN G5537871

2. OUTLINE  Introduction  Background  Evaluation  Traditional DSS Workload: Hive vs PDW  Modern OLTP Workload: MongoDB vs SQL Server  Discussion & Conclusion

3. INTRODUCTION  Motivation How does the performance and scalability of RDBMs solutions compare to the NoSQL systems?  Proposition compare MongoDB(AS/CS) with SQL Server and Hive with SQL PWD, and analyze the performance and scalability aspects on two workloads (decision support analysis and interactive data-serving).  Use YCSB and TPC-H DSS benchmarks respectively

4. BACKGROUND  Parallel Data Warehouse (PDW)  shared-nothing parallel database system built on top of SQL Server  multiple compute nodes, a single control node and other administrative service nodes.  Hive  an open-source data warehouse built on top of Hadoop  a structured data model for data that is stored in the Hadoop Distributed Filesystem (HDFS), and a SQL-like declarative query language called HiveQL

5. BACKGROUND(CONT.)  MongoDB Features  a document-oriented storage layer, indexing in the form of B- trees, auto-sharding, asynchronous replication of data between servers.  Data stored in collections which contain documents  Each document is serialized using BSON For implementation, it is created two types of MongoDB servers:  MongoDB-CS (with client-side sharding )  MongoDB-AS (Auto-Sharding)

6. EVALUATION  Make hardware and software configuration for all four systems  For PDW and Hive, use 8 disks to store the data  For YCSB benchmark, 8 nodes are used as servers and another 8 for client-benchmarks Hive and Hadoop  Use RCFile format to store data  All TPC-H tables are stored in Gzip RcCile format

7. TRADITIONAL DSS WORKLOAD: HIVE VS PDW Workload Description  use TPC-H at 4 scale factors (250,500,1000,4000,16000 GBs)  TPC-H generator doesn’t produce correct result at 16000 scale  Executed all 22 TPC-H queries  But leave 2 TPC-H refresh functions

8. TRADITIONAL DSS WORKLOAD: HIVE VS PDW Data Layout in Hive and PDW

9. TRADITIONAL DSS WORKLOAD: HIVE VS PDW Data Preparation and Load Times Hive  Generated dataset across 16 nodes  Create one hive table for each TPC-H table  Data is loaded in 2 phases:  data files loaded onto each node  data is converted from text to RCfile format. PDW  Load data into landed node  Create necessary tables

10. TRADITIONAL DSS WORKLOAD: HIVE VS PDW Performance Analysis

11. TRADITIONAL DSS WORKLOAD: HIVE VS PDW Performance Analysis(cont.)  PDW is faster than Hive in for all TPC-H queries  The average speedup of PDW over Hive is greater for small datasets  Hive has high overheads for small datasets. Scalability Analysis  Hive scales better than PDW  Hive scales well as the dataset size increases.

12. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER Workload description Extends YCSB into 2 ways:  added support for multiple instances on many database servers  Supports for Stored procedures in YCSB JBDC driver ran the YCSB benchmark on a database that consists of 640 million records

13. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER Data Preparation  Mongo-AS can automatically manage the shards by using a “balancer” process  The loading time for SQL-CS and Mongo-CS was 146 and 45 minutes respectively  SQL load time take longer because a bulk insert method was not used

14. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER Experimental Evaluation “Read-Only” workload

15. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER 95% Read 5% Update Workload

16. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER 50% Read & 50% Update workload

17. MODERN OLTP WORKLOAD: MONGODB VS SQL SERVER 95% Read 5% Append Workload

18. DISCUSSION & CONCLUSION  This evaluation shows that NoSQL systems are still behind RDBMS in performance.  PDW is also 9 times faster than Hive running TPC-H at 16TB scale  SQL-CS was able to achieve higher throughput than MongoDB

19. AUTHORS  Avrilia Floratou University of Wisconsin-Madison  Nikhil Teletia Microsoft Jim Gray Systems Lab  David J. DeWitt Microsoft Jim Gray Systems Lab  Jignesh M. Patel University of Wisconsin-Madison  Donghui Zhang Paradigm4