A new methodology for large scale nosql benchmarking

•

1 like•748 views

A presentation of the new methodology I plan to use for large scale benchmarking of various NoSQL databases. Preceded by a short comparison with the current Wikipedia infrastructure.

Technology

A new methodology for large scale benchmarking
A step by step methodology

Dory Thibault

UCL

Contact : thibault.dory@student.uclouvain.be

Sponsor : Euranova

Website : nosqlbenchmarking.com

March 1, 2011

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

Existing Wikipedia infrastructure

2 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

Existing Wikipedia infrastructure
The structured data (revision history, articles relations, user
accounts...) are stored in MySQL
Each wiki has its own database, not necessarily its own cluster
Each cluster is made of several MySQL servers using
replication
Only one master for each cluster
All the writes are handled by the master

The multiple slaves serve the reads

Currently there are 37 servers running MySQL according to
ganglia.wikimedia.org
Each one has
between 8 and 12 CPUs running at 2.2Ghz

between 32 and 64 Gb of RAM

3 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

Existing Wikipedia infrastructure

The content of the last version of an article is stored as a blob on
external storage servers
Replicated cluster of 3 MySQL hosts
Those data are stored appart from the main core databases
because this content :
Needs a lot of storage space

Is largely unused thanks to the cache servers

4 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The benchmark VS the real Wikipedia load
A very simpli

ed model
The benchmark does not try to reproduce the real load on the
MySQL clusters

There is no computational work on the structured data
There is no other cache than the one provided by the
database itself
The MySQL clusters run on a few powerful servers while the
NoSQL clusters will run on many small servers
So why Wikipedia?
The main point in using Wikipedia's data is to use real data : each
entry has a dierent size and the MapReduce computation on the
content makes sense.
5 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The new data set

All the articles from Wikipedia in English
The new data set is made of all the +10 millions articles from the
english version of Wikipedia

Sums up to 28Gb uncompressed
Each article is considered as a XML blob with all its metadata
and is identi

ed with a unique integer ID
Is that enough data?
Not really for a very big cluster. The solution is simply to insert the
same data set several times but still using unique ID for each insert.

6 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The old benchmark architecture

Scaling problem

This architecture does not scale, mainly for bandwidth reasons. The

computational power needed is small but the whole article is trans-

mited for each request.

7 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The distributed benchmark architecture

8 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The new infrastructure

Amazon EC2 infrastructure
I plan to use mainly small standard instances (1 CPU, 1.7Gb of
RAM) on the Amazon EC2 infrastructure.

The biggest cluster should be made of :
Hundreds of small EC2 instances
A few bigger servers for systems that use master or load
balancer like HBase

9 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

The measured properties

1 The raw performances : how fast is it to make all the
requests?
2 The scalability : what is the impact on the perfomances of
changing the cluster size (number of nodes and data set)?
3 The elasticity : how long does it take to get to a stable state
with increased performances when node are added to the
cluster?

10 / 13

Wikipedia infrastructure
The benchmark VS the real Wikipedia load
The updated methodology

Measuring the elasticity
The most complex of the three measures
The time needed for the system to stabilize should be dierent for
each system and for each cluster size. I have chosen to character-
ize the elasticity by computing the standard deviation for smaller
benchmark runs.
1 Use a stable cluster to determine the usual standard deviation
of the DB
2 Add the new nodes to the cluster but do not increase the data
set
3 Repeat :
Start a benchmark run and compute the standard deviation
Wait X seconds
4 Until the standard deviation for the last Y runs does not
diverge more than Z percents from the usual standard
deviation
11 / 13

What's hot

First review presentationArvind Krishnaa

Cassandra useful featuresSandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Boot Strapping in CassandraArunit Gupta

Distribute Key Value StoreSantal Li

NOSQL- Presentation on NoSQLRamakant Soni

Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani

Apache Cassandra @Geneva JUG 2013.02.26Benoit Perroud

Clustering van IT-componentenRichard Claassens CIPPE

Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010CLOUDIAN KK

The Google BigtableRomain Jacotin

distributed, concurrent, and independent access to encrypted cloud databasesswathi78

Apache cassandraAdnan Siddiqi

Ycsb benchmarkingSqrrl

IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...IEEEGLOBALSOFTSTUDENTPROJECTS

The Cassandra Distributed DatabaseEric Evans

Cassandra - A decentralized storage systemArunit Gupta

Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksPaco Nathan

Cassandra internalsnarsiman

Cluster Computing Seminar.Balvant Biradar

NoSql And The Semantic WebIrina Hutanu

What's hot (20)

First review presentation

Cassandra useful features

Boot Strapping in Cassandra

Distribute Key Value Store

NOSQL- Presentation on NoSQL

Distributed, concurrent, and independent access to encrypted cloud databases

Apache Cassandra @Geneva JUG 2013.02.26

Clustering van IT-componenten

Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010

The Google Bigtable

distributed, concurrent, and independent access to encrypted cloud databases

Apache cassandra

Ycsb benchmarking

IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...

The Cassandra Distributed Database

Cassandra - A decentralized storage system

Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks

Cassandra internals

Cluster Computing Seminar.

NoSql And The Semantic Web

Viewers also liked

[150824]symposium v4yyooooon

Top 3 design patterns in Map ReduceEdureka!

Big Data Analytics with Hadoop with @techmilindEMC

Hadoop MapReduce FundamentalsLynn Langit

Introduction To Map Reducerantav

Seminar Presentation HadoopVarun Narang

Viewers also liked (6)

[150824]symposium v4

Top 3 design patterns in Map Reduce

Big Data Analytics with Hadoop with @techmilind

Hadoop MapReduce Fundamentals

Introduction To Map Reduce

Seminar Presentation Hadoop

Similar to A new methodology for large scale nosql benchmarking

No sql databases Ankit Dubey

A request skew aware heterogeneous distributedJoão Gabriel Lima

Cache and consistency in nosqlJoão Gabriel Lima

NoSQL Introduction, Theory, ImplementationsFirat Atagun

Oracle CoherenceMustafa Ahmed

CNR @ VMUG.IT 20150304VMUG IT

No sq lv1_0Tuan Luong

Presentation on Databases in the Cloudmoshfiq

MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc

A novel solution of distributed memory no sql database for cloud computingJoão Gabriel Lima

Introduction to Apache Mesos and DC/OSSteve Wong

What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani

Nosql availability & integrityFahri Firdausillah

Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals

Comparison between mongo db and cassandra using ycsbsonalighai

Save 60% of Kubernetes storage costs on AWS & others with OpenEBSMayaData Inc

A Tour of Azure SQL Databases (NOVA SQL UG 2020)Timothy McAliley

Comparative study of no sql document, column store databases and evaluation o...ijdms

Liquid: A Scalable Deduplication File System for Virtual Machine Images Anamika Vinod

DatastoresMike02143

Similar to A new methodology for large scale nosql benchmarking (20)

No sql databases

A request skew aware heterogeneous distributed

Cache and consistency in nosql

NoSQL Introduction, Theory, Implementations

Oracle Coherence

CNR @ VMUG.IT 20150304

No sq lv1_0

Presentation on Databases in the Cloud

MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...

A novel solution of distributed memory no sql database for cloud computing

Introduction to Apache Mesos and DC/OS

What is Scalability and How can affect on overall system performance of database

Nosql availability & integrity

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

Comparison between mongo db and cassandra using ycsb

Save 60% of Kubernetes storage costs on AWS & others with OpenEBS

A Tour of Azure SQL Databases (NOVA SQL UG 2020)

Comparative study of no sql document, column store databases and evaluation o...

Liquid: A Scalable Deduplication File System for Virtual Machine Images

Datastores

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Commit 2024 - Secret Management made easyAlfredo García Lavilla

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Advanced Computer Architecture – An IntroductionDilum Bandara

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Connect Wave/ connectwave Pitch Deck Presentation

DevoxxFR 2024 Reproducible Builds with Apache Maven

Gen AI in Business - Global Trends Report 2024.pdf

Anypoint Exchange: It’s Not Just a Repo!

Powerpoint exploring the locations used in television show Time Clash

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

TeamStation AI System Report LATAM IT Salaries 2024

Commit 2024 - Secret Management made easy

What's New in Teams Calling, Meetings and Devices March 2024

Streamlining Python Development: A Guide to a Modern Project Setup

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Designing IA for AI - Information Architecture Conference 2024

SIP trunking in Janus @ Kamailio World 2024

Artificial intelligence in cctv survelliance.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Advanced Computer Architecture – An Introduction

A new methodology for large scale nosql benchmarking

1. A new methodology for large scale benchmarking A step by step methodology Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova Website : nosqlbenchmarking.com March 1, 2011

2. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure 2 / 13

3. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure The structured data (revision history, articles relations, user accounts...) are stored in MySQL Each wiki has its own database, not necessarily its own cluster Each cluster is made of several MySQL servers using replication Only one master for each cluster All the writes are handled by the master The multiple slaves serve the reads Currently there are 37 servers running MySQL according to ganglia.wikimedia.org Each one has between 8 and 12 CPUs running at 2.2Ghz between 32 and 64 Gb of RAM 3 / 13

4. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure The content of the last version of an article is stored as a blob on external storage servers Replicated cluster of 3 MySQL hosts Those data are stored appart from the main core databases because this content : Needs a lot of storage space Is largely unused thanks to the cache servers 4 / 13

5. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The benchmark VS the real Wikipedia load A very simpli

6. ed model The benchmark does not try to reproduce the real load on the MySQL clusters There is no computational work on the structured data There is no other cache than the one provided by the database itself The MySQL clusters run on a few powerful servers while the NoSQL clusters will run on many small servers So why Wikipedia? The main point in using Wikipedia's data is to use real data : each entry has a dierent size and the MapReduce computation on the content makes sense. 5 / 13

7. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The new data set All the articles from Wikipedia in English The new data set is made of all the +10 millions articles from the english version of Wikipedia Sums up to 28Gb uncompressed Each article is considered as a XML blob with all its metadata and is identi

8. ed with a unique integer ID Is that enough data? Not really for a very big cluster. The solution is simply to insert the same data set several times but still using unique ID for each insert. 6 / 13

9. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The old benchmark architecture Scaling problem This architecture does not scale, mainly for bandwidth reasons. The computational power needed is small but the whole article is trans- mited for each request. 7 / 13

10. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The distributed benchmark architecture 8 / 13

11. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The new infrastructure Amazon EC2 infrastructure I plan to use mainly small standard instances (1 CPU, 1.7Gb of RAM) on the Amazon EC2 infrastructure. The biggest cluster should be made of : Hundreds of small EC2 instances A few bigger servers for systems that use master or load balancer like HBase 9 / 13

12. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The measured properties 1 The raw performances : how fast is it to make all the requests? 2 The scalability : what is the impact on the perfomances of changing the cluster size (number of nodes and data set)? 3 The elasticity : how long does it take to get to a stable state with increased performances when node are added to the cluster? 10 / 13

13. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Measuring the elasticity The most complex of the three measures The time needed for the system to stabilize should be dierent for each system and for each cluster size. I have chosen to character- ize the elasticity by computing the standard deviation for smaller benchmark runs. 1 Use a stable cluster to determine the usual standard deviation of the DB 2 Add the new nodes to the cluster but do not increase the data set 3 Repeat : Start a benchmark run and compute the standard deviation Wait X seconds 4 Until the standard deviation for the last Y runs does not diverge more than Z percents from the usual standard deviation 11 / 13

14. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The step by step methodology 1 Start up a clean cluster of size 50 and insert all the articles 2 Measure the standard deviation for this cluster once it has stabilized 3 Choose a total number of requests and a read-only percentage 4 Start the benchmark with the chosen number of requests and read-only percentage 5 Start the MapReduce benchmark 6 Double the number of nodes in the cluster 7 Start the elasticity test 8 Double the size of the data set inserted 9 Jump to 4 with a doubled number of requests until there are no more servers to add to the cluster 12 / 13

15. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Bibliography www.nedworks.org/mark/presentations/san/Wikimedia%20architecture.pdf http://meta.wikimedia.org/wiki/Wikimedia servers http://ganglia.wikimedia.org/ 13 / 13

A new methodology for large scale nosql benchmarking

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to A new methodology for large scale nosql benchmarking

Similar to A new methodology for large scale nosql benchmarking (20)

Recently uploaded

Recently uploaded (20)

A new methodology for large scale nosql benchmarking