SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Key-value Stores
 Bigtable & Dynamo
     Iraklis Psaroudakis



Data Management in the Cloud
     EPFL, March 5th 2012

                               1
Outline
•   Motivation & NOSQL
•   Bigtable
•   Dynamo
•   Conclusions
•   Discussion




                            2
Motivation
• Real-time web requires:
  – Processing of large amounts of data while maintaining
    efficient performance
  – Distributed systems and replicated data (to globally
    distributed data centers)
     • Commodity hardware
  – High availability and tolerance to network partitioning
  – Requirements varying between
    high throughput & low latency
  – Simple storage access API
                                                          3
More database approaches
RDBMS
Col1   Col2       Col3        • SQL
5      Val1       0101011     • ACID
6      Val2       1101001     • Strict table schemas


              Bigtable
              Dynamo

Key-value stores
Key     Value            • Less flexible API
key1    0101011          • BASE
key2    val_10           • Schema-less
                                                       4
NOSQL
• Next Generation Databases mostly addressing some of
  the points: being non-relational, distributed, open-
  source and horizontally scalable. The original intention
  has been modern web-scale databases. The
  movement began early 2009 and is growing rapidly.
  Often more characteristics apply such as: schema-free,
  easy replication support, simple API, eventually
  consistent / BASE (not ACID), a huge amount, of
  data and more.
   – from http://nosql-database.org/
   – Column stores (Bigtable), document (e.g. XML) stores, key-
     value stores (Dynamo), Graph & Object DBs etc.


                                                              5
Bigtable
• Distributed storage system
  for structured data
• High scalability
• Fault tolerant
• Meets throughput-oriented but also
  latency-sensitive workloads
• Load balancing
• Strong consistency (?)
• Dynamic control over data layout
   – Does not support full relational model
   – Can serve from disk and from memory

                                              6
Data model
• “Sparse, distributed, persistent, multidim. sorted map”
• (row:string, column:string, time:int64) → string
• Column key = family:qualifier
   – Few families per table, but unlimited qualifiers per family
   – Data in a family is usually of the same type (for compression)
   – Access control lists are defined on a family-level




                                                                      7
API
// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");                    Writing data
r1.Set("anchor:www.c-span.org", "CNN");           (atomic mutations)
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);

Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {         Reading data
printf("%s %s %lld %sn",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
                                                                 8
Physical Storage & Refinements
                                              Bloom    SSTable file format:
     64K            64K            64K
     block          block          block      Filter   Persistent, ordered immutable
                                                       map from keys to values
                                              Index    (in GFS)


rowkey       family qualifier            ts    value           SSTable for locality group:
u1           user         email          2012 rb@example.com   “user”, “social” families
u1           user         email          2007 ra@example.com
                                                               A locality group supports:
u1           user         name           2007 Ricky            - Being loaded into memory
u1           social       friend         2007 u2               - User-specified
                                                                  compression algorithm
u2           user         name           2007 Sam
u2           user         telephone 2007 0780000000
u2           social       friend         2007 u1

                                                                                         9
Splitting into tablets



       Tablets of size 100-200 MB in size by default




    [Discussion] Column indices?                       10
Componentization of Tables
                           Table
Tablet                               Tablet
Start: aardvark     Log              Start: apples   Log
End: apple                           End: boat




                          SSTables

             [Discussion] Why are SSTables shared?         11
Bigtable System Architecture
                                                                            Client
                                 Master server
                                                                       Client Library
                           Performs metadata ops +
                                load balancing

       Tablet server              Tablet server             Tablet server
     Tablet     Tablet         Tablet     Tablet         Tablet       Tablet

              Tablet servers serve data from their assigned tablets



 Cluster Scheduling                     GFS                       Chubby
       System
                                              Logs
Handles failover and                                       Lock service. Holds
monitoring                   SSTables    Tablet logs       metadata. Handles
                                  And replicas             master-election.             12
Chubby
• Distributed lock service
• File System {directory/file}, for locking
• High availability
   – 5 replicas, one elected as master
   – Service live when majority is live
   – Uses Paxos algorithm to solve consensus
• A client leases a session with the service
                        Bigtable and Chubby

        Master lock      Servers directory    Tablet server #1

        Root tablet            ACL            Tablet server #2

      Bigtable schema                                            13
Finding a tablet




METADATA table   table_key    end_row_key   tablet_server
                 UserTable1   aardvark      server1.google.com
                 UserTable1   boat          server1.google.com
                 UserTable1   ear           server9.google.com
                 UserTable2   acre          server3.google.com
                                                                 14
Master operations
• Uses Chubby to monitor the health of tablet servers
   – Each tablet server holds a lock in the servers directory of Chubby
   – The server needs to renew its lease periodically with Chuby
   – If the server has lost the lock, it re-assigns its tablets to another server.
     Data and log are still in GFS.
• When (new) master starts
   –   Grabs master lock in Chubby
   –   Finds live tablet servers by scanning Chubby
   –   Communicates with them to find assigned tablets
   –   Scans the METADATA table in order to find unassigned tablets
• Handles table creation and merging of tablets
   – Tablet servers directly update METADATA on tablet split and then
     notify the master. Any failure is detected lazily by the master.

                                                                                15
Tablet Serving




• Commit log stores the writes
    – Recent writes are stores in the memtable
    – Older writes are stores in SSTables
• A read operation sees a merged view of the memtable and
  the SSTables
• Checks authorization from ACL stored in Chubby
[Discussion] What operations are in the log?
             Why did they choose not to write immediately to SSTables?   16
Compactions
• Minor compaction – convert the memtable into
  an SSTable
  – Reduce memory usage
  – Reduce log traffic on restart
• Merging compaction
  – Reduce number of SSTables
• Major compaction
  – Merging compaction that results in only one SSTable
  – No deletion records, only live data (garbage collection)
  – Good place to apply policy “keep only N versions”


                                                          17
Refinements
• Locality groups – Groups families
• Compression – Per SSTable block
• Caches
  – Scan cache: key-value pairs
  – Block cache: SSTable blocks
• Bloom Filters – Minimize I/O accesses
• Exploiting SSTable immutability
  –   No synchronization for reading the file system
  –   Efficient concurrency control per row
  –   Comfortable garbage collection of obsolete SSTables
  –   Enables quick tablet splits (SSTables can be shared)

                                                             18
Log handling
• Commit-log per tablet server
  – Co-mingling a server’s tablets’ logs into one
    commit log of the server.
     • Minimizes arbitrary I/O accesses
     • Optimizes group commits
  – Complicates tablet movement
• GFS delay spikes can mess up log write
  – Solution: two separate logs, one active at a time
  – Can have duplicates between the two that are
    resolved with sequence numbers
                                                        19
Benchmarks
         Number of 1000-byte values
         read/written per second
         Per tablet server




         Number of 1000-byte values
         read/written per second
         Aggregate values




                                20
Dynamo
• Highly-available key-value storage system
   – Targeted for primary key accesses
     and small values (< 1MB).
• Scalable and decentralized
• Gives tight control over tradeoffs between:
   – Availability, consistency, performance.
• Data partitioned using consistent hashing
• Consistency facilitated by object versioning
   – Quorum-like technique for replicas consistency
   – Decentralized replica synchronization protocol
   – Eventual consistency

                                                      21
Dynamo (2)
• Gossip protocol for:
    – Failure detection
    – Membership protocol
• Service Level Agreements (SLAs)
    – Include client’s expected request rate distribution and
      expected service latency.
    – e.g.: Response time < 300ms, for 99.9% of requests,
      for a peak load of 500 requests/sec.
•   Trusted network, no authentication
•   Incremental scalability
•   Symmetry
•   Heterogeneity, Load distribution
                                                            22
Related Work
•   P2P Systems (Unstructured and Structured)
•   Distributed File Systems
•   Databases
•   However, Dynamo aims to:
    – be always writeable
    – support only key-value API (no hierarchical
      namespaces or relational schema)
    – efficient latencies (no multi-hops)


                                                    23
Data partitioning
• get(key) : (context, value)
• put(key, context, value)
• MD5 hash on key (yields 128-bit identifier)


New nodes are randomly                     Range of hashed keys
assigned “tokens”.                         (C is responsible)
                          Ring of hashed
All nodes have full         keys (2128)
membership info.                            Node (physical) / token



                                                               24
Virtual nodes
• Random assignment leads to:
  – Non-uniform data
  – Uneven load distribution
• Solution: “virtual nodes”
  – A single node (physical machine) is assigned
    multiple random positions (“tokens”) on the ring.
  – On failure of a node, the load is evenly dispersed
  – On joining, a node accepts an equivalent load
  – Number of virtual nodes assigned to a physical
    node can be decided based on its capacity.

                                                         25
Replication
• N = number of replicated machines
  – Coordinator node replicates the data item to the
    next N nodes (machines) in the ring.
  – Preference list: List of nodes responsible for
    storing a key, with coordinator on top.
• Data versioning
  – Eventual consistency: Updates are propagated to
    all replicas asynchronously
  – Multiple versions of an object can be present
  – Timestamp-based or business-logic reconciliation
    [Discussion] How is Bigtable’s versioning different than Dynamo’s?
                                                                         26
Reconciliation of conflicting versions
• Vector clock: (node,
  counter) pair.
  – Captures causality between
    different versions
  – One vector clock per
    version per value
• Syntactic reconciliation:
  performed by system
• Semantic reconciliation:
  performed by client

                                          27
Execution
• Any node can receive a get() or put()
• Client can select a node for the request using:
  – A generic load balancer (max one hop)
  – A partition-aware client library (zero hop)
• Coordinator for a request:
  – One of the top N healthy nodes in the preference
    list of a key, typically the first.



                                                       28
Consistency
• Quorum-like consistency among replicas.
  Three configurable parameters:
  – N: number of replicas
  – R: min. no. of nodes that must read a key
  – W: min. no. of nodes that must write an update
• Full-quorum systems: R+W > N
• Dynamo sacrifices consistency
  – Usually R < N and W < N
  – Typical values: N=3, R=2, W=2

                                                     29
Handling failures
• Temporary failures:
  – “Sloppy quorum” sends read/write operations to
    N healthy machines on the ring.
     • If a node is down, its replica is sent to an extra machine
       (“hinted handoff”). It later sends the updates back to
       the failed machine.
• Permanent failures:
  – Synchronizing replicas periodically (anti-entropy)
     • Each node maintains a separate Merkle tree for each
       key range it is assigned.


                                                                30
Membership and Failure detection
• Explicit addition/removal of a node
  – Gossip protocol propagates membership changes
  – Also propagates partitioning information
• Local notion of failure detection
  – If node A does not receive reply from node B, it
    considers B failed.
  – Periodically retries communication.
  – No need for a globally consistent view of failures.


                                                          31
Implementation
                                         Storage
                                          Node



                                  Membership & Failure
  Request Coordination                                      Local Persistence Engine
                                      Detection

• Built on top of event-       • Each state machine         Pluggable Storage Engines
driven messaging substrate     instance handles exactly
                               one client request           • Berkeley Database (BDB)
• Uses Java NIO                                             Transactional Data Store
                               • State machine contains     • BDB Java Edition
• Coordinator executes         entire process and failure   • MySQL
client read & write requests   handling logic               •In-memory buffer with
                                                            persistent backing store
• State machines created
on nodes serving requests

                                                                                  32
Partitioning schemes




T random tokens per      T random tokens per       Q/S tokens per node,
node and partition by    node and Q equal-sized    equal-sized partitions
token value              partitions (Q is large)   (S machines)

               In terms of load distribution efficiency,
               Stategy 3 > Strategy 1 > Strategy 2                          33
Experiments: Latencies

                           SLA of 300ms



                            One order
                            of magnitude




        1 interval = 12h

                                    34
Experiments: Buffered writes




                              24h
            1 interval = 1h

                                    35
Experiments: Divergent Versions
• Divergent versions arise when:
  – The system is facing failure scenarios
  – Large number of concurrent write requests for a
    value. Multiple nodes end up coordinating.
• Experiment during a 24h period showed:
  – 99.94% requests saw one version
  – 0.00057% saw 2
  – 0.00047% saw 3
  – 0.00009% saw 4

                                                      36
Comparison
                         Dynamo                          Bigtable
Data Model          key value, row store   column store
API                 single tuple           single tuple and range
Data Partition      random                 ordered
Optimized for       writes                 writes
Consistency         eventual               atomic
Multiple Versions   version                timestamp
Replication         quorum                 file system
Data Center Aware   yes                    yes *
Persistency         local and plug-gable   replicated and distributed file system
Architecture        decentralized          hierarchical
Client Library      yes                    yes

Cassandra website:
Bigtable: “How can we build a distributed DB on top of GFS?”
Dynamo: “How can we build a distributed hash table appropriate for the
data center?”
                                                                                   37
Conclusions
• Bigtable and Dynamo offer two very different
  approaches for distributed data stores.
  – Bigtable is more tuned for scalability, consistency
    and Google’s specific needs.
  – Dynamo is more tuned for various SLAs, sacrificing
    consistency and achieving high availability. Scales
    only to a couple hundred machines.
• Dynamo proves to be more flexible due to N,
  R, W properties. Also correlates to academic
  research (P2P, Quorum, Merkle trees etc.).
                                                      38
Open-source implementations
• Dynamo – Kai
• Bigtable – HBase (based on Hadoop)
• Apache Cassandra
  – Brings together Dynamo’s fully distributed design
    and Bigtable’s data model




                                                        39
Thank you!
                         iraklis.psaroudakis@epfl.ch


Parts taken from:
•    Designs, Lessons and Advice from Building Large Distributed Systems – Jeff Dean (Google Fellow)
•    The Anatomy of the Google Architecture (Ed Austin)
•    Dynamo: Amazon’s Highly Available Key-value Store (Jagrut Sharma)
•    Ricardo Vilaca, Francisco Cruz, and Rui Oliveira. 2010. On the expressiveness and trade-offs of large scale
     tuple stores. In Proceedings of the 2010 international conference on On the move to meaningful internet
     systems: Part II (OTM'10)
Papers on Bigtable and Dynamo:
•    Fay Chang, Jeffrey Dean, et al. 2008. Bigtable: A Distributed Storage System for Structured Data.
•    Giuseppe DeCandia, et al. 2007. Dynamo: amazon's highly available key-value store. In Proceedings of
     twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP '07).


                                                                                                               40
CAP theorem
• Eric Brewer: You can have at most two of (Consistency,
  Availability, tolerance to network Partitions) for any shared-
  data system
   – CA: ACID DBs & Bigtable?, AP: Dynamo
   – Is Bigtable really strongly consistent, is GFS?
• Daniel Abadi: PACELC
   – “If there is a partition (P) how does the system tradeoff between
     Availability and Consistency; else (E) when the system is running as
     normal in the absence of partitions, how does the system tradeoff
     between Latency (L) and Consistency (C).”
   – PC/EC: ACID DBs & Bigtable?, PA/EL: Dynamo
• Interesting article in IEEE: “Overcoming CAP with Consistent
  Soft-State Replication”

                                                                            41
Discussion topics
• What kind of services need RDBMS?
   – In specific, what would you not implement with NOSQL?
   – Which services need transactions and strong consistency?
• Centralized / hierarchical approaches or decentralized?
   – Bigtable’s membership information is kept in METADATA table
   – Dynamo’s membership information is gossiped
• How would you scale Dynamo to large numbers of machines?
• Who answers read/write requests faster, Dynamo or Bigtable?
• Has anyone used another NOSQL system, e.g. Cassandra?



                                                                   42

Weitere ähnliche Inhalte

Was ist angesagt?

GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 

Was ist angesagt? (20)

Summary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoSummary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in Tokyo
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Google File System
Google File SystemGoogle File System
Google File System
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 

Ähnlich wie Bigtable and Dynamo

Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
liqiang xu
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real World
Enkitec
 

Ähnlich wie Bigtable and Dynamo (20)

The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
final_rac
final_racfinal_rac
final_rac
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentation
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Bigtable
BigtableBigtable
Bigtable
 
Cassandra
CassandraCassandra
Cassandra
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Ndb cluster 80_requirements
Ndb cluster 80_requirementsNdb cluster 80_requirements
Ndb cluster 80_requirements
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real World
 

Kürzlich hochgeladen

Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard ...
Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard  ...Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard  ...
Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard ...
nirzagarg
 
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men 🔝Bokaro🔝 Escorts S...
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men  🔝Bokaro🔝   Escorts S...➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men  🔝Bokaro🔝   Escorts S...
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men 🔝Bokaro🔝 Escorts S...
amitlee9823
 
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
amitlee9823
 
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
gajnagarg
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
How to Build a Simple Shopify Website
How to Build a Simple Shopify WebsiteHow to Build a Simple Shopify Website
How to Build a Simple Shopify Website
mark11275
 
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
amitlee9823
 
ab-initio-training basics and architecture
ab-initio-training basics and architectureab-initio-training basics and architecture
ab-initio-training basics and architecture
saipriyacoool
 
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
eeanqy
 
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
nirzagarg
 

Kürzlich hochgeladen (20)

VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
VIP Model Call Girls Kalyani Nagar ( Pune ) Call ON 8005736733 Starting From ...
 
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...Top Rated  Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
Top Rated Pune Call Girls Koregaon Park ⟟ 6297143586 ⟟ Call Me For Genuine S...
 
Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard ...
Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard  ...Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard  ...
Anamika Escorts Service Darbhanga ❣️ 7014168258 ❣️ High Cost Unlimited Hard ...
 
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men 🔝Bokaro🔝 Escorts S...
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men  🔝Bokaro🔝   Escorts S...➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men  🔝Bokaro🔝   Escorts S...
➥🔝 7737669865 🔝▻ Bokaro Call-girls in Women Seeking Men 🔝Bokaro🔝 Escorts S...
 
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
Whitefield Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Ba...
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Gi...
 
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
Just Call Vip call girls dharamshala Escorts ☎️9352988975 Two shot with one g...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
💫✅jodhpur 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATISFACT...
 
How to Build a Simple Shopify Website
How to Build a Simple Shopify WebsiteHow to Build a Simple Shopify Website
How to Build a Simple Shopify Website
 
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
RT Nagar Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Hy...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Hy...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Hy...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Hy...
 
Sweety Planet Packaging Design Process Book.pptx
Sweety Planet Packaging Design Process Book.pptxSweety Planet Packaging Design Process Book.pptx
Sweety Planet Packaging Design Process Book.pptx
 
High Profile Escorts Nerul WhatsApp +91-9930687706, Best Service
High Profile Escorts Nerul WhatsApp +91-9930687706, Best ServiceHigh Profile Escorts Nerul WhatsApp +91-9930687706, Best Service
High Profile Escorts Nerul WhatsApp +91-9930687706, Best Service
 
Just Call Vip call girls Nagpur Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Nagpur Escorts ☎️8617370543 Starting From 5K to 25K ...Just Call Vip call girls Nagpur Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Nagpur Escorts ☎️8617370543 Starting From 5K to 25K ...
 
ab-initio-training basics and architecture
ab-initio-training basics and architectureab-initio-training basics and architecture
ab-initio-training basics and architecture
 
call girls in Vasundhra (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Vasundhra (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...call girls in Vasundhra (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Vasundhra (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
 
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
 
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
Nisha Yadav Escorts Service Ernakulam ❣️ 7014168258 ❣️ High Cost Unlimited Ha...
 
call girls in Kaushambi (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Kaushambi (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...call girls in Kaushambi (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Kaushambi (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
 

Bigtable and Dynamo

  • 1. Key-value Stores Bigtable & Dynamo Iraklis Psaroudakis Data Management in the Cloud EPFL, March 5th 2012 1
  • 2. Outline • Motivation & NOSQL • Bigtable • Dynamo • Conclusions • Discussion 2
  • 3. Motivation • Real-time web requires: – Processing of large amounts of data while maintaining efficient performance – Distributed systems and replicated data (to globally distributed data centers) • Commodity hardware – High availability and tolerance to network partitioning – Requirements varying between high throughput & low latency – Simple storage access API 3
  • 4. More database approaches RDBMS Col1 Col2 Col3 • SQL 5 Val1 0101011 • ACID 6 Val2 1101001 • Strict table schemas Bigtable Dynamo Key-value stores Key Value • Less flexible API key1 0101011 • BASE key2 val_10 • Schema-less 4
  • 5. NOSQL • Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open- source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount, of data and more. – from http://nosql-database.org/ – Column stores (Bigtable), document (e.g. XML) stores, key- value stores (Dynamo), Graph & Object DBs etc. 5
  • 6. Bigtable • Distributed storage system for structured data • High scalability • Fault tolerant • Meets throughput-oriented but also latency-sensitive workloads • Load balancing • Strong consistency (?) • Dynamic control over data layout – Does not support full relational model – Can serve from disk and from memory 6
  • 7. Data model • “Sparse, distributed, persistent, multidim. sorted map” • (row:string, column:string, time:int64) → string • Column key = family:qualifier – Few families per table, but unlimited qualifiers per family – Data in a family is usually of the same type (for compression) – Access control lists are defined on a family-level 7
  • 8. API // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); Writing data r1.Set("anchor:www.c-span.org", "CNN"); (atomic mutations) r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { Reading data printf("%s %s %lld %sn", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } 8
  • 9. Physical Storage & Refinements Bloom SSTable file format: 64K 64K 64K block block block Filter Persistent, ordered immutable map from keys to values Index (in GFS) rowkey family qualifier ts value SSTable for locality group: u1 user email 2012 rb@example.com “user”, “social” families u1 user email 2007 ra@example.com A locality group supports: u1 user name 2007 Ricky - Being loaded into memory u1 social friend 2007 u2 - User-specified compression algorithm u2 user name 2007 Sam u2 user telephone 2007 0780000000 u2 social friend 2007 u1 9
  • 10. Splitting into tablets Tablets of size 100-200 MB in size by default [Discussion] Column indices? 10
  • 11. Componentization of Tables Table Tablet Tablet Start: aardvark Log Start: apples Log End: apple End: boat SSTables [Discussion] Why are SSTables shared? 11
  • 12. Bigtable System Architecture Client Master server Client Library Performs metadata ops + load balancing Tablet server Tablet server Tablet server Tablet Tablet Tablet Tablet Tablet Tablet Tablet servers serve data from their assigned tablets Cluster Scheduling GFS Chubby System Logs Handles failover and Lock service. Holds monitoring SSTables Tablet logs metadata. Handles And replicas master-election. 12
  • 13. Chubby • Distributed lock service • File System {directory/file}, for locking • High availability – 5 replicas, one elected as master – Service live when majority is live – Uses Paxos algorithm to solve consensus • A client leases a session with the service Bigtable and Chubby Master lock Servers directory Tablet server #1 Root tablet ACL Tablet server #2 Bigtable schema 13
  • 14. Finding a tablet METADATA table table_key end_row_key tablet_server UserTable1 aardvark server1.google.com UserTable1 boat server1.google.com UserTable1 ear server9.google.com UserTable2 acre server3.google.com 14
  • 15. Master operations • Uses Chubby to monitor the health of tablet servers – Each tablet server holds a lock in the servers directory of Chubby – The server needs to renew its lease periodically with Chuby – If the server has lost the lock, it re-assigns its tablets to another server. Data and log are still in GFS. • When (new) master starts – Grabs master lock in Chubby – Finds live tablet servers by scanning Chubby – Communicates with them to find assigned tablets – Scans the METADATA table in order to find unassigned tablets • Handles table creation and merging of tablets – Tablet servers directly update METADATA on tablet split and then notify the master. Any failure is detected lazily by the master. 15
  • 16. Tablet Serving • Commit log stores the writes – Recent writes are stores in the memtable – Older writes are stores in SSTables • A read operation sees a merged view of the memtable and the SSTables • Checks authorization from ACL stored in Chubby [Discussion] What operations are in the log? Why did they choose not to write immediately to SSTables? 16
  • 17. Compactions • Minor compaction – convert the memtable into an SSTable – Reduce memory usage – Reduce log traffic on restart • Merging compaction – Reduce number of SSTables • Major compaction – Merging compaction that results in only one SSTable – No deletion records, only live data (garbage collection) – Good place to apply policy “keep only N versions” 17
  • 18. Refinements • Locality groups – Groups families • Compression – Per SSTable block • Caches – Scan cache: key-value pairs – Block cache: SSTable blocks • Bloom Filters – Minimize I/O accesses • Exploiting SSTable immutability – No synchronization for reading the file system – Efficient concurrency control per row – Comfortable garbage collection of obsolete SSTables – Enables quick tablet splits (SSTables can be shared) 18
  • 19. Log handling • Commit-log per tablet server – Co-mingling a server’s tablets’ logs into one commit log of the server. • Minimizes arbitrary I/O accesses • Optimizes group commits – Complicates tablet movement • GFS delay spikes can mess up log write – Solution: two separate logs, one active at a time – Can have duplicates between the two that are resolved with sequence numbers 19
  • 20. Benchmarks Number of 1000-byte values read/written per second Per tablet server Number of 1000-byte values read/written per second Aggregate values 20
  • 21. Dynamo • Highly-available key-value storage system – Targeted for primary key accesses and small values (< 1MB). • Scalable and decentralized • Gives tight control over tradeoffs between: – Availability, consistency, performance. • Data partitioned using consistent hashing • Consistency facilitated by object versioning – Quorum-like technique for replicas consistency – Decentralized replica synchronization protocol – Eventual consistency 21
  • 22. Dynamo (2) • Gossip protocol for: – Failure detection – Membership protocol • Service Level Agreements (SLAs) – Include client’s expected request rate distribution and expected service latency. – e.g.: Response time < 300ms, for 99.9% of requests, for a peak load of 500 requests/sec. • Trusted network, no authentication • Incremental scalability • Symmetry • Heterogeneity, Load distribution 22
  • 23. Related Work • P2P Systems (Unstructured and Structured) • Distributed File Systems • Databases • However, Dynamo aims to: – be always writeable – support only key-value API (no hierarchical namespaces or relational schema) – efficient latencies (no multi-hops) 23
  • 24. Data partitioning • get(key) : (context, value) • put(key, context, value) • MD5 hash on key (yields 128-bit identifier) New nodes are randomly Range of hashed keys assigned “tokens”. (C is responsible) Ring of hashed All nodes have full keys (2128) membership info. Node (physical) / token 24
  • 25. Virtual nodes • Random assignment leads to: – Non-uniform data – Uneven load distribution • Solution: “virtual nodes” – A single node (physical machine) is assigned multiple random positions (“tokens”) on the ring. – On failure of a node, the load is evenly dispersed – On joining, a node accepts an equivalent load – Number of virtual nodes assigned to a physical node can be decided based on its capacity. 25
  • 26. Replication • N = number of replicated machines – Coordinator node replicates the data item to the next N nodes (machines) in the ring. – Preference list: List of nodes responsible for storing a key, with coordinator on top. • Data versioning – Eventual consistency: Updates are propagated to all replicas asynchronously – Multiple versions of an object can be present – Timestamp-based or business-logic reconciliation [Discussion] How is Bigtable’s versioning different than Dynamo’s? 26
  • 27. Reconciliation of conflicting versions • Vector clock: (node, counter) pair. – Captures causality between different versions – One vector clock per version per value • Syntactic reconciliation: performed by system • Semantic reconciliation: performed by client 27
  • 28. Execution • Any node can receive a get() or put() • Client can select a node for the request using: – A generic load balancer (max one hop) – A partition-aware client library (zero hop) • Coordinator for a request: – One of the top N healthy nodes in the preference list of a key, typically the first. 28
  • 29. Consistency • Quorum-like consistency among replicas. Three configurable parameters: – N: number of replicas – R: min. no. of nodes that must read a key – W: min. no. of nodes that must write an update • Full-quorum systems: R+W > N • Dynamo sacrifices consistency – Usually R < N and W < N – Typical values: N=3, R=2, W=2 29
  • 30. Handling failures • Temporary failures: – “Sloppy quorum” sends read/write operations to N healthy machines on the ring. • If a node is down, its replica is sent to an extra machine (“hinted handoff”). It later sends the updates back to the failed machine. • Permanent failures: – Synchronizing replicas periodically (anti-entropy) • Each node maintains a separate Merkle tree for each key range it is assigned. 30
  • 31. Membership and Failure detection • Explicit addition/removal of a node – Gossip protocol propagates membership changes – Also propagates partitioning information • Local notion of failure detection – If node A does not receive reply from node B, it considers B failed. – Periodically retries communication. – No need for a globally consistent view of failures. 31
  • 32. Implementation Storage Node Membership & Failure Request Coordination Local Persistence Engine Detection • Built on top of event- • Each state machine Pluggable Storage Engines driven messaging substrate instance handles exactly one client request • Berkeley Database (BDB) • Uses Java NIO Transactional Data Store • State machine contains • BDB Java Edition • Coordinator executes entire process and failure • MySQL client read & write requests handling logic •In-memory buffer with persistent backing store • State machines created on nodes serving requests 32
  • 33. Partitioning schemes T random tokens per T random tokens per Q/S tokens per node, node and partition by node and Q equal-sized equal-sized partitions token value partitions (Q is large) (S machines) In terms of load distribution efficiency, Stategy 3 > Strategy 1 > Strategy 2 33
  • 34. Experiments: Latencies SLA of 300ms One order of magnitude 1 interval = 12h 34
  • 35. Experiments: Buffered writes 24h 1 interval = 1h 35
  • 36. Experiments: Divergent Versions • Divergent versions arise when: – The system is facing failure scenarios – Large number of concurrent write requests for a value. Multiple nodes end up coordinating. • Experiment during a 24h period showed: – 99.94% requests saw one version – 0.00057% saw 2 – 0.00047% saw 3 – 0.00009% saw 4 36
  • 37. Comparison Dynamo Bigtable Data Model key value, row store column store API single tuple single tuple and range Data Partition random ordered Optimized for writes writes Consistency eventual atomic Multiple Versions version timestamp Replication quorum file system Data Center Aware yes yes * Persistency local and plug-gable replicated and distributed file system Architecture decentralized hierarchical Client Library yes yes Cassandra website: Bigtable: “How can we build a distributed DB on top of GFS?” Dynamo: “How can we build a distributed hash table appropriate for the data center?” 37
  • 38. Conclusions • Bigtable and Dynamo offer two very different approaches for distributed data stores. – Bigtable is more tuned for scalability, consistency and Google’s specific needs. – Dynamo is more tuned for various SLAs, sacrificing consistency and achieving high availability. Scales only to a couple hundred machines. • Dynamo proves to be more flexible due to N, R, W properties. Also correlates to academic research (P2P, Quorum, Merkle trees etc.). 38
  • 39. Open-source implementations • Dynamo – Kai • Bigtable – HBase (based on Hadoop) • Apache Cassandra – Brings together Dynamo’s fully distributed design and Bigtable’s data model 39
  • 40. Thank you! iraklis.psaroudakis@epfl.ch Parts taken from: • Designs, Lessons and Advice from Building Large Distributed Systems – Jeff Dean (Google Fellow) • The Anatomy of the Google Architecture (Ed Austin) • Dynamo: Amazon’s Highly Available Key-value Store (Jagrut Sharma) • Ricardo Vilaca, Francisco Cruz, and Rui Oliveira. 2010. On the expressiveness and trade-offs of large scale tuple stores. In Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II (OTM'10) Papers on Bigtable and Dynamo: • Fay Chang, Jeffrey Dean, et al. 2008. Bigtable: A Distributed Storage System for Structured Data. • Giuseppe DeCandia, et al. 2007. Dynamo: amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP '07). 40
  • 41. CAP theorem • Eric Brewer: You can have at most two of (Consistency, Availability, tolerance to network Partitions) for any shared- data system – CA: ACID DBs & Bigtable?, AP: Dynamo – Is Bigtable really strongly consistent, is GFS? • Daniel Abadi: PACELC – “If there is a partition (P) how does the system tradeoff between Availability and Consistency; else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between Latency (L) and Consistency (C).” – PC/EC: ACID DBs & Bigtable?, PA/EL: Dynamo • Interesting article in IEEE: “Overcoming CAP with Consistent Soft-State Replication” 41
  • 42. Discussion topics • What kind of services need RDBMS? – In specific, what would you not implement with NOSQL? – Which services need transactions and strong consistency? • Centralized / hierarchical approaches or decentralized? – Bigtable’s membership information is kept in METADATA table – Dynamo’s membership information is gossiped • How would you scale Dynamo to large numbers of machines? • Who answers read/write requests faster, Dynamo or Bigtable? • Has anyone used another NOSQL system, e.g. Cassandra? 42