SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Introduction to NoSQL
        2011.02
      Quang Nguyen
Agenda                          Communicating Knowledge



    New Challenges for RDBMS
    Introduction to NoSQL
    MongoDB Sharding




                       2
Relational DBMS                                 Communicating Knowledge



     Since 1970
     Use SQL to manipulate data
         Easy to use
         Easy to integrate with other system
     Excellent for applications such as management
      (accounting, reservations, staff management,
      etc)




                                3
ACID Properties of RDBMS               Communicating Knowledge



  Databases always satisfy this four properties
   Atomic: “all or nothing”, when a statement is
    executed, it is either successful or failed
   Consistent: data moves from one correct state to
    another correct state
   Isolated: two concurrent transaction will not
    become entangle with each other
   Durable: one a transaction has succeeded, the
    change will not be lost




                         4
What is problem of RDBMS?                        Communicating Knowledge



     Schemas aren't designed for sparse data
         Normalize, creates a lot of tables
         Joins can be prohibitively expensive
     Most importantly, databases are simply not
      designed to be distributed.




                                5
An Example of a Distributed DB                  Communicating Knowledge



     A banking system consisting of 4 branches in
      four different city. Each branch maintains
      accounts locally
         Account = (account-number, branch, balance)
     One single site that maintains information about
      branches
         Branch = (branch-name, city, assets)




                             6
An Example of a Distributed DB                        Communicating Knowledge




              Transfer $1000                     Transaction
                  From A:$3000                   coordinator
                  To B:$2000
 client


           Bank A                                     Bank B




         Clients want all-or-nothing transactions
             Transfer either happens or not at all

                                   7
An Example of a Distributed DB                            Communicating Knowledge



     Simple solution
  client               transaction        bank A          bank B
                       coordinator
               start

                               A=A-1000
               done                            B=B+1000




         What can go wrong?
              A does not have enough money
              B’s account no longer exists
              B has crashed
              Coordinator crashes     8
An Example of a Distributed DB                            Communicating Knowledge



     Two-phase Commit Protocol (2PC)
  client           transaction        bank A              bank B
                   coordinator
           start
                                                                     Locked
                            prepare
                                               prepare
                           rA
           Loss of ravailability and
                           B

                       outcome
            result higher latency!
                       outcome




                      If rA==yes && rB==yes
                          outcome = “commit”
                                                         B commits upon
                      else
                                                         receiving “commit”
                          outcome = “abort”
                                  9
Schemas vs. Schema-free                                          Communicating Knowledge



     Use tables to represent real objects
     Join operation is expensive and difficult to be
      executed in horizontal scale-out
      Name     Surname Home         Mobile Telephone Office Marital          -
                                                            Status
      Quang    Nguyen      Null     398      Null         Null     Null      null
      Cuong    Trinh       Nguyen   999      555          null     null      null
                           Dinh
                           Chieu
      -        -           -        -        -            -        -         -

                                                    user :{
      user :{                                         name: Cuong,
        name: quang,                                  surname: Trinh,
        surname: Nguyen,                              Home: Nguyen Dinh Chieu,
        mobile : 398                                  mobile : 999,
      }                                               Telephone: 555,
                                                    }
                                        10
Communicating Knowledge




New Trends and Requirements




            11
Information amount is growing fast          Communicating Knowledge



     In 2010, the amount of information created and
      replicated first time exceeded zettabytes (trillion
      gigabytes). In 2011, it surpass 1.8 zettabytes




                            12
Google: BigTable                  Communicating Knowledge



     Web Indexing
     Google Earth
     Youtube
     Google Books
     Google Mail

              High Scalability
              High Availability


                     13
Amazon: DynamoDB                       Communicating Knowledge



    RDBMS doesn’t fit requirements
    10 of thousands servers around the world
    10 million customers




               High Reliability
               High Availability


                         14
Facebook: Cassandra, HBase                       Communicating Knowledge



     People
      
                                High Scalability
          More than 800 million active users
      
                                High Availability
          More than 50% of our active users log on to
          Facebook in any given day
         Average user has 130 friends
     Activity
         More than 900 million objects that people interact with
          (pages, groups, events and community pages)
         On average, more than 250 million photos are
          uploaded per day
     Messaging system including chat, wall posts,
      and email has 135+ billion messages per month

                                15
Twitter                       Communicating Knowledge




          High Availability



                 16
CAP Theorem                             Communicating Knowledge



 It is impossible for a distributed computer system
 to simultaneously provide all three of the following
 guarantees
  Consistency: all nodes see the same data at the
    same time
  Availability: every request receives a response
    about whether it was successful or failed
  Partition Tolerance: the system continues to
    operate despite arbitrary message loss

 You have to choose only two. In almost all cases, you
 would choose availability over consistency
                         17
Consistency Level                            Communicating Knowledge



     Strong (Sequential): After the update
      completes any subsequent access will return the
      updated value.
     Weak (weaker than Sequential): The system
      does not guarantee that subsequent accesses
      will return the updated value.
     Eventual: All updates will propagate throughout
      all of the replicas in a distributed system, but that
      this may take some time. Eventually, all replicas
      will be consistent.



                             18
What is NoSQL                            Communicating Knowledge



     Stands for Not Only SQL
     Class of non-relational data storage systems
     Usually do not require a fixed table schema nor
      do they use the concept of joins
     All NoSQL offerings relax one or more of the
      ACID properties




        NoSQL !=
                           19
What is NoSQL        Communicating Knowledge




                20
NoSQL Features                            Communicating Knowledge



     Key/Value stores or “the big hash table”
         Amazon S3 (Dynamo)
         Memcached
     Schema-less, which comes in multiple flavors
         Document-based (MongoDB, CouchDB)
         Column-based (Cassandra, Hbase)
         Graph-based (neo4j)




                            21
Key/Value                                        Communicating Knowledge



     Advantages
         Very fast
         Very scalable
         Simple model
         Able to distribute horizontally


     Disadvantages
         Many data structures (objects) can't be easily
          modeled as key value pairs




                                   22
Schema-less                                       Communicating Knowledge



     Advantages
         Schema-less data model is richer than key/value
          pairs
         Eventual consistency
         Many are distributed
         Still provide excellent performance and scalability


     Disadvantages
         no ACID transactions




                                 23
Memcached        Communicating Knowledge




            24
Communicating Knowledge




25
Introduction to MongoDB                                   Communicating Knowledge



     MongoDB is document-oriented database
         Key -> Document
         Structured Document
         Schema-free
                                     user :{
                                       name: quang,
             Key = quang               surname: Nguyen,
                                       mobile : 398
                                     }

                                     user :{
                                       name: Cuong,
                                       surname: Trinh,
             Key = cuong               Home: Nguyen Dinh Chieu,
                                       mobile : 999,
                                       Telephone: 555,
                                     }
                                26
Introduction to MongoDB                           Communicating Knowledge




                             Result count: 1
                             user :{
                               name: quang,
        Query = quang          surname: Nguyen,
                               mobile : 398
                             }




                             Result count: 1
                             user :{
                               name: Cuong,
                               surname: Trinh,
        Query =cuong           Home: Nguyen Dinh Chieu,
                               mobile : 999,
                               Telephone: 555,
                             }
                        27
Features of Mongo DB                            Communicating Knowledge



     Indexing
     Stored JavaScript
     Aggregation
     File Storage
     Make Scaling out easier
         Scaling out vs. Scaling up
         Scaling out is done automatically, balanced across a
          cluster




                               28
Some applications of MongoDB                                     Communicating Knowledge



      Large scale application
      Archiving and event logging
      Document and Content Management Systems


                                       foursquare uses MongoDB to store venues
                                       and user "check-ins" into venues, sharding
                                       the data over more than 25 machines on
                                       Amazon EC2

      Craigslist uses MongoDB to archive billions
      of records

                                             Disney built a common set of tools and
                                             APIs for all games within the Interactive
                                             Media Group, using MongoDB as a
                                             common object repository to persist state
                                             information
                                        29
Communicating Knowledge




30
Introduction to Cassandra                Communicating Knowledge



     Column Family: logical division that associate
      similar data. E.g., User Column Family, Hotel
      Column Family.
     Row oriented: each row doesn’t need to have all
      the same columns as other rows like it (as in a
      relational model).
     Schema-Free




                          31
Introduction to Cassandra                         Communicating Knowledge




                  Result count: 3
                  -(Column = name, value =quang, timestamp=32345632)
  Query = quang   -(column=surname, value=Nguyen, timestamp=12345678)
                  -(column=mobile, value=398, timestamp=33592839)




                  Result count: 5
                  -(column=name, value=Cuong, timestamp=33434343)
                  -(column=surname, value=Trinh, timestamp=34568258)
                  -(column=Home, value=Nguyen Dinh Chieu,
  Query = cuong
                  timestamp=54542368)
                  -(column=mobile, value=999, timestamp=23445486)
                  -(column=Telephone, value=555, timestamp=34314642)



                              32
Features of Cassandra                             Communicating Knowledge



     Distributed and Decentralized
         Some nodes need to be set up as masters in order to
          organize other nodes, which are set up as slaves
         That there is no single point of failure
     High Availability & Fault Tolerance
         You can replace failed nodes in the cluster with no
          downtime, and you can replicate data to multiple data
          centers to offer improved local performance and
          prevent downtime if one data center experiences a
          catastrophe such as fire or flood.
     Tunable Consistency
         It allows you to easily decide the level of consistency
          you require, in balance with the level of availability
                                 33
Features of Cassandra                              Communicating Knowledge



     Elastic Scalability
         Elastic scalability refers to a special property of
          horizontal scalability. It means that your cluster can
          seamlessly scale up and scale back down.




                                 34
Some Applications of Cassandra                              Communicating Knowledge



     Large Deployments
     Lots of Writes, Statistics, and Analysis
     Geographical Distribution

                                  Facebook used Cassandra to power Inbox
                                  Search, with over 200 nodes deployed

      Twitter announced it is planning to use
      Cassandra because it can be run on large
      server clusters and is capable of taking in
      very large amounts of data at a time

                                        AppScale uses Cassandra as a back-end
                                        for Google App Engine applications



                                        35
Communicating Knowledge




36
Neo4j – Graph Database                                  Communicating Knowledge



      Data is stored as a Graph/Network
           Nodes and relationships with properties
      Schema-free
         people :{                 KNOWS            people :{
          name: quang,             KNOWS             name: Cuong,
          surname: Nguyen}                           surname: Trinh,
                                                      hobbies: uncountable}
                       KNOWS KNOWS
            WORKS
                                people:{                      OWNS
                                 name: Thanh,
      Company:{                  surname: Nguyen}          Company:{
       name: Saltlux, Vietnam                               name: TechMaster,
                                     WORKS
       Area: SearchEngine}                                  area: IT Education,
                                Company:{                   founded: 2011}
                                 name: Fami,
                                 area: Furniture}
                                     37
Neo4j – Graph Database                   Communicating Knowledge



     Find all persons that KNOWS a friend that
      KNOWS someone called “Larry Ellison”

        SELECT ?person WHERE {
        ?person neo4j:KNOWS ?friend .
        ?friend neo4j:KNOWS ?foe .
        ?foe neo4j:name "Larry Ellison" .
        }




                          38
Features of Neo4j                          Communicating Knowledge



     Disk-based
     Fully transactional like a real database (ACID is
      satisfied)
     Scale-up, massive scalability. Neo4j can handle
      graphs of several billion nodes/ relationships/
      properties on a single machine.
     No sharding




                            39
Some Applications of Neo4j                  Communicating Knowledge



     Ideal for any application that relies on the
      relationships between records
         Social Networks
         Recommendations




                            40
Communicating Knowledge




MongoDB Sharding



       41
Some Considerations                       Communicating Knowledge



     If you want to store a large volume of data or
      access to it at a higher rate higher than a single
      server can handle?
     More servers are added, what is the
      dependency between servers
     Can your application handle if one server/subset
      of servers crashes?
     What if communication has problems?




                           42
What is sharding                                Communicating Knowledge



     Sharding is the method MongoDB uses to split a
      large collection across server servers (called
      cluster)
     MongoDB does almost everything automatically;
      MongoDB lets your application grow – easily,
      robustly, and natually
         Making the cluster “invisible”
         Making the cluster always available for reads and
          writes
         Let the cluster grow easily




                               43
A Shard                                       Communicating Knowledge



     A shard is one or more servers in a cluster that
      are responsible for some subset of the data
     A shard can consist of many servers. If there is
      more than one server in a shard, each server
      has identical copy of the subset of the data

                            abc         abc

             abc
            Shard
                                  abc




                           44
Distributing Data – One range per shard                          Communicating Knowledge



     One range per shard

           [“a”, “f”)           [“f”, “n”)        [“n”, “t”)   [“t”,”{”)
            Shard 1              Shard 2           Shard 3     Shard 4


     Data movement issue
                        [“c”, “f”)

           [“a”, “f”)           [“f”, “n”)        [“n”, “t”)   [“t”,”{”)
            Shard 1              Shard 2           Shard 3     Shard 4




           [“a”, “c”)           [“c”, “n”)        [“n”, “t”)   [“t”,”{”)
            Shard 1              Shard 2           Shard 3     Shard 4

                                             45
Distributing Data – One range per shard                            Communicating Knowledge



     Data has to be moved across the cluster

            500 GB            500 GB            300 GB            300 GB
                     100 GB



            400 GB    400GB Data
                              600 GB            300 GB            300 GB


                       Movement        200 GB


            400 GB            400 GB            500 GB            300 GB


                                                         100 GB


            400 GB            400 GB            400 GB            400 GB
                                        46
Distributing Data – One range per shard                            Communicating Knowledge



     It’s worse when a new shard is added

       500 GB         500 GB            500 GB            500 GB           0 GB




        1 TB Data Movement
                 100 GB        200 GB            300 GB        400 GB



        400 GB        400 GB            400 GB            400 GB          400 GB




                                        47
Distributing Data – Multi range shards                                    Communicating Knowledge



     Each shard can contain multiple ranges. Each
      range of data is called a chunk.

            500 GB              500 GB               300 GB             300 GB
           [“a”, “f”)          [“f”, “n”)           [“n”, “t”)         [“t”, “{“)

                        100 GB, [“d”, “f”)        100 GB, [“j”, “n”)




            500 GB              500 GB               300 GB             300 GB
           [“a”, “f”)          [“f”, “n”)           [“n”, “t”)         [“t”, “{“)



            400 GB                                    400 GB            400 GB
                                 400 GB
           [“a”, “d”)                               [“n”, “t”);        [“t”, “{“);
                                [“f”, “j”)
                                                     [“d”, “f”)        [“j”, “n”)
                                             48
Sharding a collection                          Communicating Knowledge



     Key (Shard Key) is used for chunk ranges.
      Shard key is of any types
  null < numbers < strings < objects < arrays < binary data <
  ObjectIds < boolean < dates < regular expression
     MongoDB first creates a (-∞, + ∞) chunk for a
      collection
     If we add more data, MongoDB would split
      existing chunks to create new ones
     Every chunk range must be distinc, not
      overlapped with other chunk range
     Data movement is resource-consuming, a chunk
      is only 200MB by default
                             49
Balancing                                      Communicating Knowledge



     MongoDB automatically moves chunks from one
      shard to another in order to
         keep the data evenly distributed and
         minimize the data movement. A shard must have at
          least 09 more chunks than the least populous chunk




                               50
Choose a Sharding Key                             Communicating Knowledge



     Avoid low-cardinality sharding key
         Continent value: “Asia”, “Australia”, ”Europe”,”North
          America”, or “South America”
         MongoDB can’t split these chunks any further! The
          chunks will just keep getting bigger and bigger.




     Ascending key does not work as well as we
      expect.
         Use timestamp as sharding key
         Everything is added to the last chunk

                                51
Choose a Sharding Key                      Communicating Knowledge



     Random Shard key
         Waste of index


     So, we want to choose a shard key with nice
      data locality, but not so local that we end up with
      a hot spot.




                            52
When to shard                                   Communicating Knowledge



     In general, you should start with a nonsharded
      setup and convert it to a sharded one, if and
      when you need.
         Run out of disk space on your current machine.
         Want to write data faster than a single process can
          handle.
         Want to keep a larger proportion of data in memory to
          improve performance.




                               53
Communicating Knowledge




Thank you!


    54
Communicating Knowledge




55

Weitere ähnliche Inhalte

Ähnlich wie Overview of NoSQL

Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022HostedbyConfluent
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedJazz Yao-Tsung Wang
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
 
Minnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraMinnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraJeff Bollinger
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAmazon Web Services
 
Object Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLObject Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLShahriar Hyder
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Dataexponential-inc
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
 
Presentation dell - into the cloud with dell
Presentation   dell - into the cloud with dellPresentation   dell - into the cloud with dell
Presentation dell - into the cloud with dellxKinAnx
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeParinaz Ameri
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL DatabaseSteve Min
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
Datastax - Why Your RDBMS fails at scale
Datastax - Why Your RDBMS fails at scaleDatastax - Why Your RDBMS fails at scale
Datastax - Why Your RDBMS fails at scaleRuth Mills
 
Customer Day 18th May 2012
Customer Day 18th May 2012Customer Day 18th May 2012
Customer Day 18th May 2012ctrlsblog
 
Clearing the air on Cloud Computing
Clearing the air on Cloud ComputingClearing the air on Cloud Computing
Clearing the air on Cloud ComputingKarthik Sankar
 
Melbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new ArchitectureMelbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new ArchitectureSaul Caganoff
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architectureArangoDB Database
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 

Ähnlich wie Overview of NoSQL (20)

Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
Many Sources, Many Sinks, One Stream With Joel Eaton | Current 2022
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 
Minnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with CassandraMinnebar 2013 - Scaling with Cassandra
Minnebar 2013 - Scaling with Cassandra
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
 
Object Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQLObject Relational Mapping with LINQ To SQL
Object Relational Mapping with LINQ To SQL
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
 
Presentation dell - into the cloud with dell
Presentation   dell - into the cloud with dellPresentation   dell - into the cloud with dell
Presentation dell - into the cloud with dell
 
Databases through out and beyond Big Data hype
Databases through out and beyond Big Data hypeDatabases through out and beyond Big Data hype
Databases through out and beyond Big Data hype
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL Database
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
Datastax - Why Your RDBMS fails at scale
Datastax - Why Your RDBMS fails at scaleDatastax - Why Your RDBMS fails at scale
Datastax - Why Your RDBMS fails at scale
 
Customer Day 18th May 2012
Customer Day 18th May 2012Customer Day 18th May 2012
Customer Day 18th May 2012
 
Clearing the air on Cloud Computing
Clearing the air on Cloud ComputingClearing the air on Cloud Computing
Clearing the air on Cloud Computing
 
Melbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new ArchitectureMelbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new Architecture
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Beyond Relational Databases
Beyond Relational DatabasesBeyond Relational Databases
Beyond Relational Databases
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 

Mehr von Nguyen Quang

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningNguyen Quang
 
Deep Dialog System Review
Deep Dialog System ReviewDeep Dialog System Review
Deep Dialog System ReviewNguyen Quang
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksNguyen Quang
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Web browser architecture
Web browser architectureWeb browser architecture
Web browser architectureNguyen Quang
 
X Query for beginner
X Query for beginnerX Query for beginner
X Query for beginnerNguyen Quang
 
Redistributable introtoscrum
Redistributable introtoscrumRedistributable introtoscrum
Redistributable introtoscrumNguyen Quang
 
Text categorization
Text categorizationText categorization
Text categorizationNguyen Quang
 
A holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion miningA holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion miningNguyen Quang
 

Mehr von Nguyen Quang (13)

Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Dialog System Review
Deep Dialog System ReviewDeep Dialog System Review
Deep Dialog System Review
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Web browser architecture
Web browser architectureWeb browser architecture
Web browser architecture
 
Eclipse orion
Eclipse orionEclipse orion
Eclipse orion
 
X Query for beginner
X Query for beginnerX Query for beginner
X Query for beginner
 
Html 5
Html 5Html 5
Html 5
 
Redistributable introtoscrum
Redistributable introtoscrumRedistributable introtoscrum
Redistributable introtoscrum
 
Text categorization
Text categorizationText categorization
Text categorization
 
A holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion miningA holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion mining
 

Kürzlich hochgeladen

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 

Kürzlich hochgeladen (20)

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 

Overview of NoSQL

  • 1. Introduction to NoSQL 2011.02 Quang Nguyen
  • 2. Agenda Communicating Knowledge  New Challenges for RDBMS  Introduction to NoSQL  MongoDB Sharding 2
  • 3. Relational DBMS Communicating Knowledge  Since 1970  Use SQL to manipulate data  Easy to use  Easy to integrate with other system  Excellent for applications such as management (accounting, reservations, staff management, etc) 3
  • 4. ACID Properties of RDBMS Communicating Knowledge Databases always satisfy this four properties  Atomic: “all or nothing”, when a statement is executed, it is either successful or failed  Consistent: data moves from one correct state to another correct state  Isolated: two concurrent transaction will not become entangle with each other  Durable: one a transaction has succeeded, the change will not be lost 4
  • 5. What is problem of RDBMS? Communicating Knowledge  Schemas aren't designed for sparse data  Normalize, creates a lot of tables  Joins can be prohibitively expensive  Most importantly, databases are simply not designed to be distributed. 5
  • 6. An Example of a Distributed DB Communicating Knowledge  A banking system consisting of 4 branches in four different city. Each branch maintains accounts locally Account = (account-number, branch, balance)  One single site that maintains information about branches Branch = (branch-name, city, assets) 6
  • 7. An Example of a Distributed DB Communicating Knowledge Transfer $1000 Transaction From A:$3000 coordinator To B:$2000 client Bank A Bank B  Clients want all-or-nothing transactions  Transfer either happens or not at all 7
  • 8. An Example of a Distributed DB Communicating Knowledge  Simple solution client transaction bank A bank B coordinator start A=A-1000 done B=B+1000  What can go wrong?  A does not have enough money  B’s account no longer exists  B has crashed  Coordinator crashes 8
  • 9. An Example of a Distributed DB Communicating Knowledge  Two-phase Commit Protocol (2PC) client transaction bank A bank B coordinator start Locked prepare prepare rA Loss of ravailability and B outcome result higher latency! outcome If rA==yes && rB==yes outcome = “commit” B commits upon else receiving “commit” outcome = “abort” 9
  • 10. Schemas vs. Schema-free Communicating Knowledge  Use tables to represent real objects  Join operation is expensive and difficult to be executed in horizontal scale-out Name Surname Home Mobile Telephone Office Marital - Status Quang Nguyen Null 398 Null Null Null null Cuong Trinh Nguyen 999 555 null null null Dinh Chieu - - - - - - - - user :{ user :{ name: Cuong, name: quang, surname: Trinh, surname: Nguyen, Home: Nguyen Dinh Chieu, mobile : 398 mobile : 999, } Telephone: 555, } 10
  • 11. Communicating Knowledge New Trends and Requirements 11
  • 12. Information amount is growing fast Communicating Knowledge  In 2010, the amount of information created and replicated first time exceeded zettabytes (trillion gigabytes). In 2011, it surpass 1.8 zettabytes 12
  • 13. Google: BigTable Communicating Knowledge  Web Indexing  Google Earth  Youtube  Google Books  Google Mail High Scalability High Availability 13
  • 14. Amazon: DynamoDB Communicating Knowledge  RDBMS doesn’t fit requirements  10 of thousands servers around the world  10 million customers High Reliability High Availability 14
  • 15. Facebook: Cassandra, HBase Communicating Knowledge  People  High Scalability More than 800 million active users  High Availability More than 50% of our active users log on to Facebook in any given day  Average user has 130 friends  Activity  More than 900 million objects that people interact with (pages, groups, events and community pages)  On average, more than 250 million photos are uploaded per day  Messaging system including chat, wall posts, and email has 135+ billion messages per month 15
  • 16. Twitter Communicating Knowledge High Availability 16
  • 17. CAP Theorem Communicating Knowledge It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees  Consistency: all nodes see the same data at the same time  Availability: every request receives a response about whether it was successful or failed  Partition Tolerance: the system continues to operate despite arbitrary message loss You have to choose only two. In almost all cases, you would choose availability over consistency 17
  • 18. Consistency Level Communicating Knowledge  Strong (Sequential): After the update completes any subsequent access will return the updated value.  Weak (weaker than Sequential): The system does not guarantee that subsequent accesses will return the updated value.  Eventual: All updates will propagate throughout all of the replicas in a distributed system, but that this may take some time. Eventually, all replicas will be consistent. 18
  • 19. What is NoSQL Communicating Knowledge  Stands for Not Only SQL  Class of non-relational data storage systems  Usually do not require a fixed table schema nor do they use the concept of joins  All NoSQL offerings relax one or more of the ACID properties NoSQL != 19
  • 20. What is NoSQL Communicating Knowledge 20
  • 21. NoSQL Features Communicating Knowledge  Key/Value stores or “the big hash table”  Amazon S3 (Dynamo)  Memcached  Schema-less, which comes in multiple flavors  Document-based (MongoDB, CouchDB)  Column-based (Cassandra, Hbase)  Graph-based (neo4j) 21
  • 22. Key/Value Communicating Knowledge  Advantages  Very fast  Very scalable  Simple model  Able to distribute horizontally  Disadvantages  Many data structures (objects) can't be easily modeled as key value pairs 22
  • 23. Schema-less Communicating Knowledge  Advantages  Schema-less data model is richer than key/value pairs  Eventual consistency  Many are distributed  Still provide excellent performance and scalability  Disadvantages  no ACID transactions 23
  • 24. Memcached Communicating Knowledge 24
  • 26. Introduction to MongoDB Communicating Knowledge  MongoDB is document-oriented database  Key -> Document  Structured Document  Schema-free user :{ name: quang, Key = quang surname: Nguyen, mobile : 398 } user :{ name: Cuong, surname: Trinh, Key = cuong Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555, } 26
  • 27. Introduction to MongoDB Communicating Knowledge Result count: 1 user :{ name: quang, Query = quang surname: Nguyen, mobile : 398 } Result count: 1 user :{ name: Cuong, surname: Trinh, Query =cuong Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555, } 27
  • 28. Features of Mongo DB Communicating Knowledge  Indexing  Stored JavaScript  Aggregation  File Storage  Make Scaling out easier  Scaling out vs. Scaling up  Scaling out is done automatically, balanced across a cluster 28
  • 29. Some applications of MongoDB Communicating Knowledge  Large scale application  Archiving and event logging  Document and Content Management Systems foursquare uses MongoDB to store venues and user "check-ins" into venues, sharding the data over more than 25 machines on Amazon EC2 Craigslist uses MongoDB to archive billions of records Disney built a common set of tools and APIs for all games within the Interactive Media Group, using MongoDB as a common object repository to persist state information 29
  • 31. Introduction to Cassandra Communicating Knowledge  Column Family: logical division that associate similar data. E.g., User Column Family, Hotel Column Family.  Row oriented: each row doesn’t need to have all the same columns as other rows like it (as in a relational model).  Schema-Free 31
  • 32. Introduction to Cassandra Communicating Knowledge Result count: 3 -(Column = name, value =quang, timestamp=32345632) Query = quang -(column=surname, value=Nguyen, timestamp=12345678) -(column=mobile, value=398, timestamp=33592839) Result count: 5 -(column=name, value=Cuong, timestamp=33434343) -(column=surname, value=Trinh, timestamp=34568258) -(column=Home, value=Nguyen Dinh Chieu, Query = cuong timestamp=54542368) -(column=mobile, value=999, timestamp=23445486) -(column=Telephone, value=555, timestamp=34314642) 32
  • 33. Features of Cassandra Communicating Knowledge  Distributed and Decentralized  Some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves  That there is no single point of failure  High Availability & Fault Tolerance  You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.  Tunable Consistency  It allows you to easily decide the level of consistency you require, in balance with the level of availability 33
  • 34. Features of Cassandra Communicating Knowledge  Elastic Scalability  Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. 34
  • 35. Some Applications of Cassandra Communicating Knowledge  Large Deployments  Lots of Writes, Statistics, and Analysis  Geographical Distribution Facebook used Cassandra to power Inbox Search, with over 200 nodes deployed Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time AppScale uses Cassandra as a back-end for Google App Engine applications 35
  • 37. Neo4j – Graph Database Communicating Knowledge  Data is stored as a Graph/Network  Nodes and relationships with properties  Schema-free people :{ KNOWS people :{ name: quang, KNOWS name: Cuong, surname: Nguyen} surname: Trinh, hobbies: uncountable} KNOWS KNOWS WORKS people:{ OWNS name: Thanh, Company:{ surname: Nguyen} Company:{ name: Saltlux, Vietnam name: TechMaster, WORKS Area: SearchEngine} area: IT Education, Company:{ founded: 2011} name: Fami, area: Furniture} 37
  • 38. Neo4j – Graph Database Communicating Knowledge  Find all persons that KNOWS a friend that KNOWS someone called “Larry Ellison” SELECT ?person WHERE { ?person neo4j:KNOWS ?friend . ?friend neo4j:KNOWS ?foe . ?foe neo4j:name "Larry Ellison" . } 38
  • 39. Features of Neo4j Communicating Knowledge  Disk-based  Fully transactional like a real database (ACID is satisfied)  Scale-up, massive scalability. Neo4j can handle graphs of several billion nodes/ relationships/ properties on a single machine.  No sharding 39
  • 40. Some Applications of Neo4j Communicating Knowledge  Ideal for any application that relies on the relationships between records  Social Networks  Recommendations 40
  • 42. Some Considerations Communicating Knowledge  If you want to store a large volume of data or access to it at a higher rate higher than a single server can handle?  More servers are added, what is the dependency between servers  Can your application handle if one server/subset of servers crashes?  What if communication has problems? 42
  • 43. What is sharding Communicating Knowledge  Sharding is the method MongoDB uses to split a large collection across server servers (called cluster)  MongoDB does almost everything automatically; MongoDB lets your application grow – easily, robustly, and natually  Making the cluster “invisible”  Making the cluster always available for reads and writes  Let the cluster grow easily 43
  • 44. A Shard Communicating Knowledge  A shard is one or more servers in a cluster that are responsible for some subset of the data  A shard can consist of many servers. If there is more than one server in a shard, each server has identical copy of the subset of the data abc abc abc Shard abc 44
  • 45. Distributing Data – One range per shard Communicating Knowledge  One range per shard [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4  Data movement issue [“c”, “f”) [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4 [“a”, “c”) [“c”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4 45
  • 46. Distributing Data – One range per shard Communicating Knowledge  Data has to be moved across the cluster 500 GB 500 GB 300 GB 300 GB 100 GB 400 GB 400GB Data 600 GB 300 GB 300 GB Movement 200 GB 400 GB 400 GB 500 GB 300 GB 100 GB 400 GB 400 GB 400 GB 400 GB 46
  • 47. Distributing Data – One range per shard Communicating Knowledge  It’s worse when a new shard is added 500 GB 500 GB 500 GB 500 GB 0 GB 1 TB Data Movement 100 GB 200 GB 300 GB 400 GB 400 GB 400 GB 400 GB 400 GB 400 GB 47
  • 48. Distributing Data – Multi range shards Communicating Knowledge  Each shard can contain multiple ranges. Each range of data is called a chunk. 500 GB 500 GB 300 GB 300 GB [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”, “{“) 100 GB, [“d”, “f”) 100 GB, [“j”, “n”) 500 GB 500 GB 300 GB 300 GB [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”, “{“) 400 GB 400 GB 400 GB 400 GB [“a”, “d”) [“n”, “t”); [“t”, “{“); [“f”, “j”) [“d”, “f”) [“j”, “n”) 48
  • 49. Sharding a collection Communicating Knowledge  Key (Shard Key) is used for chunk ranges. Shard key is of any types null < numbers < strings < objects < arrays < binary data < ObjectIds < boolean < dates < regular expression  MongoDB first creates a (-∞, + ∞) chunk for a collection  If we add more data, MongoDB would split existing chunks to create new ones  Every chunk range must be distinc, not overlapped with other chunk range  Data movement is resource-consuming, a chunk is only 200MB by default 49
  • 50. Balancing Communicating Knowledge  MongoDB automatically moves chunks from one shard to another in order to  keep the data evenly distributed and  minimize the data movement. A shard must have at least 09 more chunks than the least populous chunk 50
  • 51. Choose a Sharding Key Communicating Knowledge  Avoid low-cardinality sharding key  Continent value: “Asia”, “Australia”, ”Europe”,”North America”, or “South America”  MongoDB can’t split these chunks any further! The chunks will just keep getting bigger and bigger.  Ascending key does not work as well as we expect.  Use timestamp as sharding key  Everything is added to the last chunk 51
  • 52. Choose a Sharding Key Communicating Knowledge  Random Shard key  Waste of index  So, we want to choose a shard key with nice data locality, but not so local that we end up with a hot spot. 52
  • 53. When to shard Communicating Knowledge  In general, you should start with a nonsharded setup and convert it to a sharded one, if and when you need.  Run out of disk space on your current machine.  Want to write data faster than a single process can handle.  Want to keep a larger proportion of data in memory to improve performance. 53

Hinweis der Redaktion

  1. Tell story of RDMBSWhy RDBMS is popularWhat is the problem of RDBMSWhy need features of NoSQL
  2. MongoDB is very good at real-time inserts, updates, and queries. Scalability and replication are provided which are necessary functions for large web sites&apos; real-time data store
  3. Scalability is an architectural feature of a system that can continue serving a greaternumber of requests with little degradation in performance. Vertical scaling—simplyadding more hardware capacity and memory to your existing machine—is the easiestway to achieve this. Horizontal scaling means adding more machines that have all orsome of the data on them so that no one machine has to bear the entire burden ofserving requests. But then the software itself must have an internal mechanism forkeeping its data in sync with the other nodes in the cluster.