Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Can the elephants handle the no sql onslaught
1. CAN THE ELEPHANTS HANDLE
THE NO-SQL ONSLAUGHT?
AUNG THU RHA HEIN
G5537871
2. OUTLINE
Introduction
Background
Evaluation
Traditional DSS Workload: Hive vs PDW
Modern OLTP Workload: MongoDB vs SQL Server
Discussion & Conclusion
3. INTRODUCTION
Motivation
How does the performance and scalability of RDBMs solutions compare
to the NoSQL systems?
Proposition
compare MongoDB(AS/CS) with SQL Server and Hive with SQL PWD,
and analyze the performance and scalability aspects on two workloads
(decision support analysis and interactive data-serving).
Use YCSB and TPC-H DSS benchmarks respectively
4. BACKGROUND
Parallel Data Warehouse (PDW)
shared-nothing parallel database system built on top of SQL
Server
multiple compute nodes, a single control node and other
administrative service nodes.
Hive
an open-source data warehouse built on top of Hadoop
a structured data model for data that is stored in the Hadoop
Distributed Filesystem (HDFS), and a SQL-like declarative query
language called HiveQL
5. BACKGROUND(CONT.)
MongoDB
Features
a document-oriented storage layer, indexing in the form of B-
trees, auto-sharding, asynchronous replication of data between
servers.
Data stored in collections which contain documents
Each document is serialized using BSON
For implementation, it is created two types of MongoDB servers:
MongoDB-CS (with client-side sharding )
MongoDB-AS (Auto-Sharding)
6. EVALUATION
Make hardware and software configuration for all four systems
For PDW and Hive, use 8 disks to store the data
For YCSB benchmark, 8 nodes are used as servers and another 8 for
client-benchmarks
Hive and Hadoop
Use RCFile format to store data
All TPC-H tables are stored in Gzip RcCile format
7. TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Workload Description
use TPC-H at 4 scale factors (250,500,1000,4000,16000 GBs)
TPC-H generator doesn’t produce correct result at 16000 scale
Executed all 22 TPC-H queries
But leave 2 TPC-H refresh functions
9. TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Data Preparation and Load Times
Hive
Generated dataset across 16 nodes
Create one hive table for each TPC-H table
Data is loaded in 2 phases:
data files loaded onto each node
data is converted from text to RCfile format.
PDW
Load data into landed node
Create necessary tables
11. TRADITIONAL DSS WORKLOAD:
HIVE VS PDW
Performance Analysis(cont.)
PDW is faster than Hive in for all TPC-H queries
The average speedup of PDW over Hive is greater for small datasets
Hive has high overheads for small datasets.
Scalability Analysis
Hive scales better than PDW
Hive scales well as the dataset size increases.
12. MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER
Workload description
Extends YCSB into 2 ways:
added support for multiple instances on many database servers
Supports for Stored procedures in YCSB JBDC driver
ran the YCSB benchmark on a database that consists of 640 million records
13. MODERN OLTP WORKLOAD:
MONGODB VS SQL SERVER
Data Preparation
Mongo-AS can automatically manage the shards by using a
“balancer” process
The loading time for SQL-CS and Mongo-CS was 146 and 45
minutes respectively
SQL load time take longer because a bulk insert method was not
used
18. DISCUSSION & CONCLUSION
This evaluation shows that NoSQL systems are still behind RDBMS in
performance.
PDW is also 9 times faster than Hive running TPC-H at 16TB scale
SQL-CS was able to achieve higher throughput than MongoDB
19. AUTHORS
Avrilia Floratou
University of Wisconsin-Madison
Nikhil Teletia
Microsoft Jim Gray Systems Lab
David J. DeWitt
Microsoft Jim Gray Systems Lab
Jignesh M. Patel
University of Wisconsin-Madison
Donghui Zhang
Paradigm4