SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Google Bigtable
         Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
          Wilson C. Hsieh, Deborah A. Wallach, Mike
           Burrows, Tushar Chandra, Andrew Fikes,
                      Robert E. Gruber
                         Google, Inc.

                  UWCS OS Seminar Discussion
                        Erik Paulson
                       2 October 2006



See also the (other)UW presentation by Jeff Dean in September of 2005
 (See the link on the seminar page, or just google for “google bigtable”)
Before we begin…
• Intersection of databases and distributed
  systems
• Will try to explain (or at least warn) when
  we hit a patch of database
• Remember this is a discussion!




                                   2 of 19
Google Scale
• Lots of data
  – Copies of the web, satellite data, user data, email and
    USENET, Subversion backing store
• Many incoming requests
• No commercial system big enough
  – Couldn’t afford it if there was one
  – Might not have made appropriate design choices
• Firm believers in the End-to-End argument
• 450,000 machines (NYTimes estimate, June 14 th
  2006

                                            3 of 19
Building Blocks
•   Scheduler (Google WorkQueue)
•   Google Filesystem
•   Chubby Lock service
•   Two other pieces helpful but not required
    – Sawzall
    – MapReduce (despite what the Internet says)

• BigTable: build a more application-friendly
  storage service using these parts
                                      4 of 19
Google File System
• Large-scale distributed “filesystem”
• Master: responsible for metadata
• Chunk servers: responsible for reading
  and writing large chunks of data
• Chunks replicated on 3 machines, master
  responsible for ensuring replicas exist
• OSDI ’04 Paper

                               5 of 19
Chubby
• {lock/file/name} service
• Coarse-grained locks, can store small
  amount of data in a lock
• 5 replicas, need a majority vote to be
  active
• Also an OSDI ’06 Paper



                                  6 of 19
Data model: a big map
•<Row, Column, Timestamp> triple for key - lookup, insert, and delete API
•Arbitrary “columns” on a row-by-row basis
    •Column family:qualifier. Family is heavyweight, qualifier lightweight
    •Column-oriented physical store- rows are sparse!
•Does not support a relational model
    •No table-wide integrity constraints
    •No multirow transactions




                                                               7 of 19
SSTable
• Immutable, sorted file of key-value
  pairs
• Chunks of data plus an index
  – Index is of block ranges, not values

                                 SSTable
         64K     64K     64K
         block   block   block

                                 Index



                                           8 of 19
Tablet
• Contains some range of rows of the table
• Built out of multiple SSTables

Tablet   Start:aardvark   End:apple

                          SSTable                               SSTable
64K      64K     64K                  64K     64K      64K
block    block   block                block   block    block

                          Index                                 Index




                                                      9 of 19
Table
• Multiple tablets make up the table
• SSTables can be shared
• Tablets do not overlap, SSTables can overlap

  Tablet                 Tablet
  aardvark     apple     apple_two_E     boat




   SSTable SSTable     SSTable SSTable




                                                10 of 19
Finding a tablet




                   11 of 19
Servers
• Tablet servers manage tablets, multiple tablets
  per server. Each tablet is 100-200 megs
  – Each tablet lives at only one server
  – Tablet server splits tablets that get too big

• Master responsible for load balancing and fault
  tolerance
  – Use Chubby to monitor health of tablet servers,
    restart failed servers
  – GFS replicates data. Prefer to start tablet server on
    same machine that the data is already at

                                              12 of 19
Editing a table
• Mutations are logged, then applied to
  an in-memory version
• Logfile stored in GFS
                   Tablet
     Insert        Memtable
     Insert
     Delete
                  apple_two_E     boat
     Insert
     Delete

     Insert
                SSTable SSTable
                                         13 of 19
Compactions
• Minor compaction – convert the memtable into
  an SSTable
  – Reduce memory usage
  – Reduce log traffic on restart
• Merging compaction
  – Reduce number of SSTables
  – Good place to apply policy “keep only N versions”
• Major compaction
  – Merging compaction that results in only one SSTable
  – No deletion records, only live data

                                           14 of 19
Locality Groups
• Group column families together into an
  SSTable
  – Avoid mingling data, ie page contents and
    page metadata
  – Can keep some groups all in memory
• Can compress locality groups
• Bloom Filters on locality groups – avoid
  searching SSTable

                                     15 of 19
Microbenchmarks




              16 of 19
17 of 19
Application at Google




                  18 of 19
Lessons learned
• Interesting point- only implement some of
  the requirements, since the last is
  probably not needed
• Many types of failure possible
• Big systems need proper systems-level
  monitoring
• Value simple designs


                                 19 of 19

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Big table
Big tableBig table
Big table
 
google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
Big table
Big tableBig table
Big table
 
Summary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoSummary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in Tokyo
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Big table
Big tableBig table
Big table
 
Big table presentation-final
Big table presentation-finalBig table presentation-final
Big table presentation-final
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Bigtable
BigtableBigtable
Bigtable
 
BigTable And Hbase
BigTable And HbaseBigTable And Hbase
BigTable And Hbase
 
Cloud Technology: Virtualization
Cloud Technology: VirtualizationCloud Technology: Virtualization
Cloud Technology: Virtualization
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and Boxwood
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Google cluster architecture
Google cluster architecture Google cluster architecture
Google cluster architecture
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 

Ähnlich wie Bigtable

Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
liqiang xu
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 

Ähnlich wie Bigtable (20)

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Cassandra at Vast
Cassandra at VastCassandra at Vast
Cassandra at Vast
 
Austin Cassandra Users 6/19: Apache Cassandra at Vast
Austin Cassandra Users 6/19: Apache Cassandra at VastAustin Cassandra Users 6/19: Apache Cassandra at Vast
Austin Cassandra Users 6/19: Apache Cassandra at Vast
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
09. storage-part-1
09. storage-part-109. storage-part-1
09. storage-part-1
 
01 introduction to cloud computing technology
01 introduction to cloud computing technology01 introduction to cloud computing technology
01 introduction to cloud computing technology
 
北航云计算公开课01 introduction to cloud computing technology
北航云计算公开课01 introduction to cloud computing technology北航云计算公开课01 introduction to cloud computing technology
北航云计算公开课01 introduction to cloud computing technology
 
The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
Kinetic basho public
Kinetic basho publicKinetic basho public
Kinetic basho public
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Postgres db performance improvements
Postgres db performance improvementsPostgres db performance improvements
Postgres db performance improvements
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 

Bigtable

  • 1. Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. UWCS OS Seminar Discussion Erik Paulson 2 October 2006 See also the (other)UW presentation by Jeff Dean in September of 2005 (See the link on the seminar page, or just google for “google bigtable”)
  • 2. Before we begin… • Intersection of databases and distributed systems • Will try to explain (or at least warn) when we hit a patch of database • Remember this is a discussion! 2 of 19
  • 3. Google Scale • Lots of data – Copies of the web, satellite data, user data, email and USENET, Subversion backing store • Many incoming requests • No commercial system big enough – Couldn’t afford it if there was one – Might not have made appropriate design choices • Firm believers in the End-to-End argument • 450,000 machines (NYTimes estimate, June 14 th 2006 3 of 19
  • 4. Building Blocks • Scheduler (Google WorkQueue) • Google Filesystem • Chubby Lock service • Two other pieces helpful but not required – Sawzall – MapReduce (despite what the Internet says) • BigTable: build a more application-friendly storage service using these parts 4 of 19
  • 5. Google File System • Large-scale distributed “filesystem” • Master: responsible for metadata • Chunk servers: responsible for reading and writing large chunks of data • Chunks replicated on 3 machines, master responsible for ensuring replicas exist • OSDI ’04 Paper 5 of 19
  • 6. Chubby • {lock/file/name} service • Coarse-grained locks, can store small amount of data in a lock • 5 replicas, need a majority vote to be active • Also an OSDI ’06 Paper 6 of 19
  • 7. Data model: a big map •<Row, Column, Timestamp> triple for key - lookup, insert, and delete API •Arbitrary “columns” on a row-by-row basis •Column family:qualifier. Family is heavyweight, qualifier lightweight •Column-oriented physical store- rows are sparse! •Does not support a relational model •No table-wide integrity constraints •No multirow transactions 7 of 19
  • 8. SSTable • Immutable, sorted file of key-value pairs • Chunks of data plus an index – Index is of block ranges, not values SSTable 64K 64K 64K block block block Index 8 of 19
  • 9. Tablet • Contains some range of rows of the table • Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K 64K 64K 64K 64K 64K block block block block block block Index Index 9 of 19
  • 10. Table • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTable 10 of 19
  • 11. Finding a tablet 11 of 19
  • 12. Servers • Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 megs – Each tablet lives at only one server – Tablet server splits tablets that get too big • Master responsible for load balancing and fault tolerance – Use Chubby to monitor health of tablet servers, restart failed servers – GFS replicates data. Prefer to start tablet server on same machine that the data is already at 12 of 19
  • 13. Editing a table • Mutations are logged, then applied to an in-memory version • Logfile stored in GFS Tablet Insert Memtable Insert Delete apple_two_E boat Insert Delete Insert SSTable SSTable 13 of 19
  • 14. Compactions • Minor compaction – convert the memtable into an SSTable – Reduce memory usage – Reduce log traffic on restart • Merging compaction – Reduce number of SSTables – Good place to apply policy “keep only N versions” • Major compaction – Merging compaction that results in only one SSTable – No deletion records, only live data 14 of 19
  • 15. Locality Groups • Group column families together into an SSTable – Avoid mingling data, ie page contents and page metadata – Can keep some groups all in memory • Can compress locality groups • Bloom Filters on locality groups – avoid searching SSTable 15 of 19
  • 16. Microbenchmarks 16 of 19
  • 19. Lessons learned • Interesting point- only implement some of the requirements, since the last is probably not needed • Many types of failure possible • Big systems need proper systems-level monitoring • Value simple designs 19 of 19