SlideShare a Scribd company logo
1 of 13
Download to read offline
A new methodology for large scale benchmarking
              A step by step methodology




                     Dory Thibault

                         UCL

      Contact : thibault.dory@student.uclouvain.be


                  Sponsor : Euranova


           Website : nosqlbenchmarking.com




                    March 1, 2011
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

Existing Wikipedia infrastructure




                                                  2 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

Existing Wikipedia infrastructure
  The structured data (revision history, articles relations, user
  accounts...) are stored in MySQL
       Each wiki has its own database, not necessarily its own cluster
       Each cluster is made of several MySQL servers using
       replication
       Only one master for each cluster
             All the writes are handled by the master

             The multiple slaves serve the reads

       Currently there are 37 servers running MySQL according to
       ganglia.wikimedia.org
       Each one has
             between 8 and 12 CPUs running at         2.2Ghz

             between 32 and 64 Gb of RAM



                                                                         3 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

Existing Wikipedia infrastructure



  The content of the last version of an article is stored as a blob on
  external storage servers
      Replicated cluster of 3 MySQL hosts
      Those data are stored appart from the main core databases
      because this content :
             Needs a lot of storage space

             Is largely unused thanks to the cache servers




                                                                         4 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

The benchmark VS the real Wikipedia load
  A very simpli
ed model
  The benchmark does not try to reproduce the real load on the
  MySQL clusters

      There is no computational work on the structured data
      There is no other cache than the one provided by the
      database itself
      The MySQL clusters run on a few powerful servers while the
      NoSQL clusters will run on many small servers
  So why Wikipedia?
  The main point in using Wikipedia's data is to use real data : each
  entry has a dierent size and the MapReduce computation on the
  content makes sense.
                                                                        5 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

The new data set

  All the articles from Wikipedia in English
  The new data set is made of all the +10 millions articles from the
  english version of Wikipedia

       Sums up to 28Gb uncompressed
       Each article is considered as a XML blob with all its metadata
       and is identi
ed with a unique integer ID
  Is that enough data?
  Not really for a very big cluster. The solution is simply to insert the
  same data set several times but still using unique ID for each insert.



                                                                            6 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

The old benchmark architecture




  Scaling problem

  This architecture does not scale, mainly for bandwidth reasons. The

  computational power needed is small but the whole article is trans-

  mited for each request.



                                                                        7 / 13
Wikipedia infrastructure
      The benchmark VS the real Wikipedia load
                    The updated methodology

The distributed benchmark architecture




                                                 8 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

The new infrastructure


  Amazon EC2 infrastructure
  I plan to use mainly small standard instances (1 CPU, 1.7Gb of
  RAM) on the Amazon EC2 infrastructure.

  The biggest cluster should be made of :
      Hundreds of small EC2 instances
      A few bigger servers for systems that use master or load
      balancer like HBase




                                                                   9 / 13
Wikipedia infrastructure
       The benchmark VS the real Wikipedia load
                     The updated methodology

The measured properties


   1   The raw performances :                     how fast is it to make all the
       requests?
   2   The scalability :         what is the impact on the perfomances of
       changing the cluster size (number of nodes and data set)?
   3   The elasticity : how long does it take to get to a stable state
       with increased performances when node are added to the
       cluster?




                                                                                   10 / 13
Wikipedia infrastructure
        The benchmark VS the real Wikipedia load
                      The updated methodology

Measuring the elasticity
  The most complex of the three measures
  The time needed for the system to stabilize should be dierent for
  each system and for each cluster size. I have chosen to character-
  ize the elasticity by computing the standard deviation for smaller
  benchmark runs.
    1   Use a stable cluster to determine the usual standard deviation
        of the DB
    2   Add the new nodes to the cluster but do not increase the data
        set
    3   Repeat :
              Start a benchmark run and compute the standard deviation
              Wait X seconds
    4   Until the standard deviation for the last Y runs does not
        diverge more than Z percents from the usual standard
        deviation
                                                                         11 / 13

More Related Content

What's hot

First review presentation
First review presentationFirst review presentation
First review presentationArvind Krishnaa
 
Boot Strapping in Cassandra
Boot Strapping  in CassandraBoot Strapping  in Cassandra
Boot Strapping in CassandraArunit Gupta
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value StoreSantal Li
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesPapitha Velumani
 
Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Benoit Perroud
 
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010CLOUDIAN KK
 
distributed, concurrent, and independent access to encrypted cloud databases
distributed, concurrent, and independent access to encrypted cloud databasesdistributed, concurrent, and independent access to encrypted cloud databases
distributed, concurrent, and independent access to encrypted cloud databasesswathi78
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarkingSqrrl
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...IEEEGLOBALSOFTSTUDENTPROJECTS
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksStrata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksPaco Nathan
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Cluster Computing Seminar.
Cluster Computing Seminar.Cluster Computing Seminar.
Cluster Computing Seminar.Balvant Biradar
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 

What's hot (20)

First review presentation
First review presentationFirst review presentation
First review presentation
 
Cassandra useful features
Cassandra useful featuresCassandra useful features
Cassandra useful features
 
Boot Strapping in Cassandra
Boot Strapping  in CassandraBoot Strapping  in Cassandra
Boot Strapping in Cassandra
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Distributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databasesDistributed, concurrent, and independent access to encrypted cloud databases
Distributed, concurrent, and independent access to encrypted cloud databases
 
Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
 
The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
distributed, concurrent, and independent access to encrypted cloud databases
distributed, concurrent, and independent access to encrypted cloud databasesdistributed, concurrent, and independent access to encrypted cloud databases
distributed, concurrent, and independent access to encrypted cloud databases
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarking
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Distributed, concurrent, and independ...
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksStrata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Cluster Computing Seminar.
Cluster Computing Seminar.Cluster Computing Seminar.
Cluster Computing Seminar.
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 

Viewers also liked

[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4yyooooon
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Viewers also liked (6)

[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to A new methodology for large scale nosql benchmarking

No sql databases
No sql databases No sql databases
No sql databases Ankit Dubey
 
A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedJoão Gabriel Lima
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosqlJoão Gabriel Lima
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsFirat Atagun
 
CNR @ VMUG.IT 20150304
CNR @ VMUG.IT 20150304CNR @ VMUG.IT 20150304
CNR @ VMUG.IT 20150304VMUG IT
 
Presentation on Databases in the Cloud
Presentation on Databases in the CloudPresentation on Databases in the Cloud
Presentation on Databases in the Cloudmoshfiq
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc
 
A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingJoão Gabriel Lima
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrityFahri Firdausillah
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSMayaData Inc
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)Timothy McAliley
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...ijdms
 
Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Anamika Vinod
 

Similar to A new methodology for large scale nosql benchmarking (20)

No sql databases
No sql databases No sql databases
No sql databases
 
A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributed
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosql
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
Oracle Coherence
Oracle CoherenceOracle Coherence
Oracle Coherence
 
CNR @ VMUG.IT 20150304
CNR @ VMUG.IT 20150304CNR @ VMUG.IT 20150304
CNR @ VMUG.IT 20150304
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Presentation on Databases in the Cloud
Presentation on Databases in the CloudPresentation on Databases in the Cloud
Presentation on Databases in the Cloud
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computing
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrity
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...
 
Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images
 
Datastores
DatastoresDatastores
Datastores
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

A new methodology for large scale nosql benchmarking

  • 1. A new methodology for large scale benchmarking A step by step methodology Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova Website : nosqlbenchmarking.com March 1, 2011
  • 2. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure 2 / 13
  • 3. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure The structured data (revision history, articles relations, user accounts...) are stored in MySQL Each wiki has its own database, not necessarily its own cluster Each cluster is made of several MySQL servers using replication Only one master for each cluster All the writes are handled by the master The multiple slaves serve the reads Currently there are 37 servers running MySQL according to ganglia.wikimedia.org Each one has between 8 and 12 CPUs running at 2.2Ghz between 32 and 64 Gb of RAM 3 / 13
  • 4. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Existing Wikipedia infrastructure The content of the last version of an article is stored as a blob on external storage servers Replicated cluster of 3 MySQL hosts Those data are stored appart from the main core databases because this content : Needs a lot of storage space Is largely unused thanks to the cache servers 4 / 13
  • 5. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The benchmark VS the real Wikipedia load A very simpli
  • 6. ed model The benchmark does not try to reproduce the real load on the MySQL clusters There is no computational work on the structured data There is no other cache than the one provided by the database itself The MySQL clusters run on a few powerful servers while the NoSQL clusters will run on many small servers So why Wikipedia? The main point in using Wikipedia's data is to use real data : each entry has a dierent size and the MapReduce computation on the content makes sense. 5 / 13
  • 7. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The new data set All the articles from Wikipedia in English The new data set is made of all the +10 millions articles from the english version of Wikipedia Sums up to 28Gb uncompressed Each article is considered as a XML blob with all its metadata and is identi
  • 8. ed with a unique integer ID Is that enough data? Not really for a very big cluster. The solution is simply to insert the same data set several times but still using unique ID for each insert. 6 / 13
  • 9. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The old benchmark architecture Scaling problem This architecture does not scale, mainly for bandwidth reasons. The computational power needed is small but the whole article is trans- mited for each request. 7 / 13
  • 10. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The distributed benchmark architecture 8 / 13
  • 11. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The new infrastructure Amazon EC2 infrastructure I plan to use mainly small standard instances (1 CPU, 1.7Gb of RAM) on the Amazon EC2 infrastructure. The biggest cluster should be made of : Hundreds of small EC2 instances A few bigger servers for systems that use master or load balancer like HBase 9 / 13
  • 12. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The measured properties 1 The raw performances : how fast is it to make all the requests? 2 The scalability : what is the impact on the perfomances of changing the cluster size (number of nodes and data set)? 3 The elasticity : how long does it take to get to a stable state with increased performances when node are added to the cluster? 10 / 13
  • 13. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Measuring the elasticity The most complex of the three measures The time needed for the system to stabilize should be dierent for each system and for each cluster size. I have chosen to character- ize the elasticity by computing the standard deviation for smaller benchmark runs. 1 Use a stable cluster to determine the usual standard deviation of the DB 2 Add the new nodes to the cluster but do not increase the data set 3 Repeat : Start a benchmark run and compute the standard deviation Wait X seconds 4 Until the standard deviation for the last Y runs does not diverge more than Z percents from the usual standard deviation 11 / 13
  • 14. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology The step by step methodology 1 Start up a clean cluster of size 50 and insert all the articles 2 Measure the standard deviation for this cluster once it has stabilized 3 Choose a total number of requests and a read-only percentage 4 Start the benchmark with the chosen number of requests and read-only percentage 5 Start the MapReduce benchmark 6 Double the number of nodes in the cluster 7 Start the elasticity test 8 Double the size of the data set inserted 9 Jump to 4 with a doubled number of requests until there are no more servers to add to the cluster 12 / 13
  • 15. Wikipedia infrastructure The benchmark VS the real Wikipedia load The updated methodology Bibliography www.nedworks.org/mark/presentations/san/Wikimedia%20architecture.pdf http://meta.wikimedia.org/wiki/Wikimedia servers http://ganglia.wikimedia.org/ 13 / 13