SlideShare a Scribd company logo
1 of 40
Download to read offline
NoSQL overview
Implementation free
  Benoit Perroud
  21. January 2011, 01. March 2011
Disclaimer




     Any views or opinions presented in this presentation are
    solely those of the author and do not necessarily represent
                         those of Verisign.




2
Outline

    •   Introduction
    •   Scalability and Availability
    •   Difficulties to scale
    •   CAP Theorem
    •   NoSQL Goals
    •   NoSQL Taxonomy
    •   Concepts and Patterns
    •   Existing implementations
    •   NoSQL in the Real World
    •   Conclusion



3
Introduction




4
NoSQL Term

    • Mandatory name disambiguation :

          NoSQL stands for Not Only SQL.

    • The term NoSQL is more or less attributed to Eric Evans,
      a Rackspace employee, who used it in early 2009 when
      Johan Oskarsson, a Last.fm employee, wanted to
      organize an event to discuss open-source distributed
      databases.




5
Wikipedia Definition

    • [Wikipedia] NoSQL is a term used to designate database
      management systems that differ from classic relational
      database management systems (RDBMS) in some way.
      These data stores may not require fixed table schemas,
      usually avoid join operations, do not attempt to provide
      ACID (atomicity, consistency, isolation, durability)
      properties and typically scale horizontally.




6
Tag Cloud




7
Why NoSQL ?

    • Is there really a problem with SQL and RDBMS ?
         No there isn't !
         SQL is powerful, ACID (atomicity, consistency, isolation,
         durability) properties are well-established, developers and DBAs
         have dominated it.


    • So why the hell need we NoSQL ?
    • What was the motivations of Google and Amazon to
      invest huge amount in research around NoSQL ?
    • Why “social sites” are so hard to scale ?




8
Scalability and Availability




9
Scalability definition

     • [Wikipedia] Scalability is a desirable property of a
       system, a network, or a process, which indicates its
       ability to either handle growing amounts of work in a
       graceful manner or to be readily enlarged.
       In summary : handle load and peaks.

     • Scalability in two dimensions :
        • Scale up → scale vertically (increase RAM in an existing node)
        • Scale out → scale horizontally (add a node to the cluster)




10
Availability definition

      • [Wikipedia] Availability refers to the ability of the users to access
        and use the system. If a user cannot access the system, it is said
        to be unavailable. Generally, the term downtime is used to refer to
        periods when a system is unavailable.
        In summary : minimize downtime.


     Availability %           Downtime per year   Downtime per month   Downtime per week
     90% ("one nine")         36.5 days           72 hours             16.8 hours
     95%                      18.25 days          36 hours             8.4 hours
     99% ("two nines")        3.65 days           7.20 hours           1.68 hours
     99.9% ("three nines")    8.76 hours          43.2 minutes         10.1 minutes
     99.99% ("four nines")    52.56 minutes       4.32 minutes         1.01 minutes
     99.999% ("five nines")   5.26 minutes        25.9 seconds         6.05 seconds
     99.9999% ("six nines")   31.5 seconds        2.59 seconds         0.605 seconds




11
Difficulties to Scale




12
RDBMS Scalability

     • RDBMS are hard to scale. Hard should be understood
       as costly
          RDBMS licenses, hardware, DBAs' and operational costs grow
          non linearly with the load.


       RDBMS have either :
     • single point of failure (SpoF)
          → the master DB
     • or replication latency
          → distributed transactions : two-phase commit, paxos algorithm.




13
Hardware Scalability

     • Commodity hardware and appliances are I/O bound
        • But network throughput is cheaper than hard disk throughput.
     • Hard disks is (was?) the main bottleneck in today's
       computing.
        • Random access have a high latency (disk seek)
        • Throughput have increased, but not proportionally with the
          storage


       → Distributing data across a network of small computers
       (and applying the data locality concept) scale better
       (cheaper) than a huge appliance.



14
Pioneers in Scaling The Big

     • Google : huge amount of data (hundreds of petabytes, do not
       fit on a single appliance)
        • need to be partitioned MTBF is proportional to number of machines.
          → BigTable.
     • Amazon : high availability (99.999%)
        • need to have redundancy and write scalability
          → Dynamo
     • Facebook, Twitter, "social sites”
        •   no cluster of data to be partitioned,
        •   long tail (old data still time to time accessed),
        •   lots of users connected in the same time,
        •   user specific content (predictions hard to achieve, few static pages)




15
CAP Theorem




16
Remind of CAP theorem

     • Consistency : all nodes see the same data at the same
       time
     • Availability : node failures do not prevent survivors from
       continuing to operate
     • Partition Tolerance : the system continues to operate
       despite arbitrary message loss

       According to the theorem, a distributed system can
       satisfy any two of these guarantees at the same time,
       but not all three.




17
NoSQL trade-offs

     • NoSQL datastores have typically done other trade-offs
       than RDBMS to the CAP theorem
        • Most of them gave up the “C” of the theorem, giving up the ACID
          properties in the same way.
     • NoSQL datastores also have a more simpler data
       access pattern
        • value = get(key)
        • put(key, value)
        • remove(key)




18
NoSQL Goals




19
NoSQL promises

     • NoSQL finality is to achieve (horizontal) scalability and
       high availability.

        • Business goal : Keep cost growing proportionally with the load
          (tight provisioning).
        • Operational goal :
            • Scale the system by simply adding node (or removing).
            • The system runs on commodity hardware.




20
NoSQL Taxonomy




21
Taxonomy

       NoSQL most common types are :

     • Document store
        • store document which structure can be explored.
     • Key/value store
        • simple hash map data access pattern.
     • Column oriented store
        • Something between simple key/value and complex document store
     • Graph database
        • store node and edges, walk-through data access
     • Object database




22
Classification




23     Picture from http://blog.nahurst.com/visual-guide-to-nosql-systems
Concepts and Patterns




24
Implementation side : Consistent Hashing

     • [Wikipedia] Consistent hashing is a scheme that
       provides hash table functionality in a way that the
       addition or removal of one slot does not significantly
       change the mapping of keys to slots.




25      Picture from http://www.lexemetech.com/2007/11/consistent-hashing.html
Implementation side : Bloom Filter

     • [Wikipedia] Bloom filter is a space-efficient probabilistic
       data structure that is used to test whether an element is
       a member of a set.




26      Picture from http://en.wikipedia.org/wiki/File:Bloom_filter.svg
Implementation side : Quorum

     • [Wikipedia] A quorum is the minimum number of votes
       that a distributed transaction has to obtain in order to be
       allowed to perform an operation in a distributed system. A
       quorum-based technique is implemented to enforce
       consistent operation in a distributed system.

     • Quorum : R + W > N
          N : number of replica, R : number of node read, W : number of
          node written.
        • R = 1, W = N
        • R = N, W = 1
        • R = N/2, W = N/2 (+1 if N is even)



27
Implementation side : Vector Clocks

     • [Wikipedia] Vector Clocks is an algorithm for generating
       a partial ordering of events in a distributed system and
       detecting causality violations.




28      Picture from http://en.wikipedia.org/wiki/File:Vector_Clock.svg
Implementation side : other Common Concepts

     • Replication
        • Multi-master (Gossip, agents, P2P)
        • Master-slave (+failover)
     • Merkle Tree
        • The main use of hash trees is to make sure that data blocks received from
          other peers in a peer-to-peer network are received undamaged and unaltered
     • Multiversion concurrency control
        • MVCC is a concurrency control method commonly used by database
          management systems to provide concurrent access to the database and in
          programming languages to implement transactional memory.
     • SEDA
        • Staged Event-Driven Architecture
     • LMT (Log Merge Tree)
        • Efficient replacement of B-Tree that require less disk seeks.




29
User side : Map Reduce

     • [Wikipedia] MapReduce is a framework for processing
       huge datasets on certain kinds of distributable problems
       using a large number of computers (nodes). It is inspired
       by the map and reduce functions commonly used in
       functional programming.




30      Picture from http://code.google.com/edu/parallel/mapreduce-tutorial.html
User side : Inverted Indexes

     • [Wikipedia] An inverted index is an index data structure
       storing a mapping from content, such as words or
       numbers, to its locations in a database file, or in a
       document or a set of documents. The purpose of an
       inverted index is to allow fast full text searches, at a cost
       of increased processing when a document is added to
       the database (data denormalization).




31      Picture from http://developer.apple.com/library/mac/
User side : other Patterns

     • Idempotent updates
        • Repeating the update twice do not lead to inconsistent data.
     • Offline (asynchronous) processing
        • Push on change (vs. pull on demand), batches
     • SOA and services isolation
        • Service failure resilience. Product is a mash-up of back-end
          services.
     • Schemaless (ColumnFamily-based data model)
        • Schema update is no more required
     • Data locality
        • Processing is sent to data instead of data sent do workers.




32
Existing Implementations




33
Non exhaustive list of Existing Implementations

     •   Google BigTable, Megastore   •   CouchDB,
     •   Amazon Dynamo,               •   MongoDB,
     •   Apache Hbase,                •   Tokyo Cabinet,
     •   Apache Cassandra,            •   SimpleDB,
     •   Riak,                        •   Redis,
     •   Project Voldemort,           •   Memcached,
     •   Flockdb,                     •   Infinispan,
     •   Neo4j,                       •   Scalaris,
     •   Hypertable,                  •   Terrastore,
     •   Jackrabbit,                  •   Mysql HandlerSocket,
     •   Solr, ElsaticSearch, ...     •   Almost a new project every
                                          week ...



34
NoSQL in the Real World




35
Real World Example

     • Facebook new messaging system
        • Based on Hbase (consistency model simplier than Cassandra)
     • Amazon shopping cart
        • Based on Dynamo, never delete a row. Only add delta (+1, +2,
          -1) in the cart.
     • Twitter
        • Use Cassandra for real time analytics
     • Google Megastore
        • ACID within partitions, lower consistency across partitions
        • Synchronous replication with Paxos algorithm




36
Real World Example (2)

     • Digg, Last.fm, NYT, Guardian, Yahoo, Flickr, NetFlix,
       Adobe, Mozilla, Github, Linkedin, StumbleUpon, ...

     • MapReduce usage is increasing daily
        • Google and Amazon have sat their domination in being the first
          to masterize big data.


     • Can your business jump into the NoSQL wave ?




37
Conclusion




38
Conclusion

     • NoSQL is not a general purpose datastore !
        • Eventually consistent model can be tricky, and can reserve
          nasty surprises if not used carefully.
        • MapReduce search job have high latency
     • SQL and NoSQL are complementary
        • Knowing both allow to put the right technology to the right place.
     • NoSQL has challenged RDBMS supremacy
        • New ideas for RDBMS are emerging
        • See HandlerSocket, a mysql NoSQL plugin.




39
Thanks for listening
     Questions ?
       Benoit Perroud
       bperroud@verisign.com
       @killerwhile on Twitter




40

More Related Content

What's hot

Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesRun Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesBernd Ocklin
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydbDaniel Austin
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 

What's hot (20)

Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesRun Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in Kubernetes
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
10c introduction
10c introduction10c introduction
10c introduction
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 

Similar to NoSQL overview implementation free

Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesMaynooth University
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdfShaimaaMohamedGalal
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax
 
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Ontico
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 

Similar to NoSQL overview implementation free (20)

Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
NoSQL
NoSQLNoSQL
NoSQL
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Introduction
IntroductionIntroduction
Introduction
 
NOSQL
NOSQLNOSQL
NOSQL
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

NoSQL overview implementation free

  • 1. NoSQL overview Implementation free Benoit Perroud 21. January 2011, 01. March 2011
  • 2. Disclaimer Any views or opinions presented in this presentation are solely those of the author and do not necessarily represent those of Verisign. 2
  • 3. Outline • Introduction • Scalability and Availability • Difficulties to scale • CAP Theorem • NoSQL Goals • NoSQL Taxonomy • Concepts and Patterns • Existing implementations • NoSQL in the Real World • Conclusion 3
  • 5. NoSQL Term • Mandatory name disambiguation : NoSQL stands for Not Only SQL. • The term NoSQL is more or less attributed to Eric Evans, a Rackspace employee, who used it in early 2009 when Johan Oskarsson, a Last.fm employee, wanted to organize an event to discuss open-source distributed databases. 5
  • 6. Wikipedia Definition • [Wikipedia] NoSQL is a term used to designate database management systems that differ from classic relational database management systems (RDBMS) in some way. These data stores may not require fixed table schemas, usually avoid join operations, do not attempt to provide ACID (atomicity, consistency, isolation, durability) properties and typically scale horizontally. 6
  • 8. Why NoSQL ? • Is there really a problem with SQL and RDBMS ? No there isn't ! SQL is powerful, ACID (atomicity, consistency, isolation, durability) properties are well-established, developers and DBAs have dominated it. • So why the hell need we NoSQL ? • What was the motivations of Google and Amazon to invest huge amount in research around NoSQL ? • Why “social sites” are so hard to scale ? 8
  • 10. Scalability definition • [Wikipedia] Scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner or to be readily enlarged. In summary : handle load and peaks. • Scalability in two dimensions : • Scale up → scale vertically (increase RAM in an existing node) • Scale out → scale horizontally (add a node to the cluster) 10
  • 11. Availability definition • [Wikipedia] Availability refers to the ability of the users to access and use the system. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable. In summary : minimize downtime. Availability % Downtime per year Downtime per month Downtime per week 90% ("one nine") 36.5 days 72 hours 16.8 hours 95% 18.25 days 36 hours 8.4 hours 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds 11
  • 13. RDBMS Scalability • RDBMS are hard to scale. Hard should be understood as costly RDBMS licenses, hardware, DBAs' and operational costs grow non linearly with the load. RDBMS have either : • single point of failure (SpoF) → the master DB • or replication latency → distributed transactions : two-phase commit, paxos algorithm. 13
  • 14. Hardware Scalability • Commodity hardware and appliances are I/O bound • But network throughput is cheaper than hard disk throughput. • Hard disks is (was?) the main bottleneck in today's computing. • Random access have a high latency (disk seek) • Throughput have increased, but not proportionally with the storage → Distributing data across a network of small computers (and applying the data locality concept) scale better (cheaper) than a huge appliance. 14
  • 15. Pioneers in Scaling The Big • Google : huge amount of data (hundreds of petabytes, do not fit on a single appliance) • need to be partitioned MTBF is proportional to number of machines. → BigTable. • Amazon : high availability (99.999%) • need to have redundancy and write scalability → Dynamo • Facebook, Twitter, "social sites” • no cluster of data to be partitioned, • long tail (old data still time to time accessed), • lots of users connected in the same time, • user specific content (predictions hard to achieve, few static pages) 15
  • 17. Remind of CAP theorem • Consistency : all nodes see the same data at the same time • Availability : node failures do not prevent survivors from continuing to operate • Partition Tolerance : the system continues to operate despite arbitrary message loss According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three. 17
  • 18. NoSQL trade-offs • NoSQL datastores have typically done other trade-offs than RDBMS to the CAP theorem • Most of them gave up the “C” of the theorem, giving up the ACID properties in the same way. • NoSQL datastores also have a more simpler data access pattern • value = get(key) • put(key, value) • remove(key) 18
  • 20. NoSQL promises • NoSQL finality is to achieve (horizontal) scalability and high availability. • Business goal : Keep cost growing proportionally with the load (tight provisioning). • Operational goal : • Scale the system by simply adding node (or removing). • The system runs on commodity hardware. 20
  • 22. Taxonomy NoSQL most common types are : • Document store • store document which structure can be explored. • Key/value store • simple hash map data access pattern. • Column oriented store • Something between simple key/value and complex document store • Graph database • store node and edges, walk-through data access • Object database 22
  • 23. Classification 23 Picture from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • 25. Implementation side : Consistent Hashing • [Wikipedia] Consistent hashing is a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots. 25 Picture from http://www.lexemetech.com/2007/11/consistent-hashing.html
  • 26. Implementation side : Bloom Filter • [Wikipedia] Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. 26 Picture from http://en.wikipedia.org/wiki/File:Bloom_filter.svg
  • 27. Implementation side : Quorum • [Wikipedia] A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system. • Quorum : R + W > N N : number of replica, R : number of node read, W : number of node written. • R = 1, W = N • R = N, W = 1 • R = N/2, W = N/2 (+1 if N is even) 27
  • 28. Implementation side : Vector Clocks • [Wikipedia] Vector Clocks is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations. 28 Picture from http://en.wikipedia.org/wiki/File:Vector_Clock.svg
  • 29. Implementation side : other Common Concepts • Replication • Multi-master (Gossip, agents, P2P) • Master-slave (+failover) • Merkle Tree • The main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered • Multiversion concurrency control • MVCC is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory. • SEDA • Staged Event-Driven Architecture • LMT (Log Merge Tree) • Efficient replacement of B-Tree that require less disk seeks. 29
  • 30. User side : Map Reduce • [Wikipedia] MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes). It is inspired by the map and reduce functions commonly used in functional programming. 30 Picture from http://code.google.com/edu/parallel/mapreduce-tutorial.html
  • 31. User side : Inverted Indexes • [Wikipedia] An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database (data denormalization). 31 Picture from http://developer.apple.com/library/mac/
  • 32. User side : other Patterns • Idempotent updates • Repeating the update twice do not lead to inconsistent data. • Offline (asynchronous) processing • Push on change (vs. pull on demand), batches • SOA and services isolation • Service failure resilience. Product is a mash-up of back-end services. • Schemaless (ColumnFamily-based data model) • Schema update is no more required • Data locality • Processing is sent to data instead of data sent do workers. 32
  • 34. Non exhaustive list of Existing Implementations • Google BigTable, Megastore • CouchDB, • Amazon Dynamo, • MongoDB, • Apache Hbase, • Tokyo Cabinet, • Apache Cassandra, • SimpleDB, • Riak, • Redis, • Project Voldemort, • Memcached, • Flockdb, • Infinispan, • Neo4j, • Scalaris, • Hypertable, • Terrastore, • Jackrabbit, • Mysql HandlerSocket, • Solr, ElsaticSearch, ... • Almost a new project every week ... 34
  • 35. NoSQL in the Real World 35
  • 36. Real World Example • Facebook new messaging system • Based on Hbase (consistency model simplier than Cassandra) • Amazon shopping cart • Based on Dynamo, never delete a row. Only add delta (+1, +2, -1) in the cart. • Twitter • Use Cassandra for real time analytics • Google Megastore • ACID within partitions, lower consistency across partitions • Synchronous replication with Paxos algorithm 36
  • 37. Real World Example (2) • Digg, Last.fm, NYT, Guardian, Yahoo, Flickr, NetFlix, Adobe, Mozilla, Github, Linkedin, StumbleUpon, ... • MapReduce usage is increasing daily • Google and Amazon have sat their domination in being the first to masterize big data. • Can your business jump into the NoSQL wave ? 37
  • 39. Conclusion • NoSQL is not a general purpose datastore ! • Eventually consistent model can be tricky, and can reserve nasty surprises if not used carefully. • MapReduce search job have high latency • SQL and NoSQL are complementary • Knowing both allow to put the right technology to the right place. • NoSQL has challenged RDBMS supremacy • New ideas for RDBMS are emerging • See HandlerSocket, a mysql NoSQL plugin. 39
  • 40. Thanks for listening Questions ? Benoit Perroud bperroud@verisign.com @killerwhile on Twitter 40