This document provides an overview of Google's Megastore database system. It discusses three key aspects: the data model and schema language for structuring data, transactions for maintaining consistency, and replication across datacenters for high availability. The data model takes a relational approach and uses the concept of entity groups to partition data at a fine-grained level for scalability. Transactions provide ACID semantics within entity groups. Replication uses Paxos consensus for strong consistency across datacenters.
3. Three Aspects of Megastore
• Data Model to be a DB
– Data layout
– Indexing
• Transactions and ACID
– Within Entity Group
– Across Entity Group
• Replication across datacenter (not be researched in detail in
this presentation)
– Synchronous replication
– Optimized Paxos
2012/3/25 3
4. What is?
Megastore is:
A database over Bigtable,
with High Availability across datacenters.
Bigdata Philosophy:
fine-grained partitioning to make things easy,
data placement for relations,
and Paxos
then, a simple API/Language for convenience of usage!
2012/3/25 4
5. Target Applications
• Interactive online services • Application developers
– User facing applications – Be familiar with RDBMS, SQL
– Difficult to give up “read-
• Conflicting requirements modify-write” idiom
– Highly scalable (size, – But now need high scalability
throughput) for bigdata
– Rapid development, fast time-
to-market
– Responsive, Low latency
– Consistent view of data
– Highly available
• Reads vs. Writes
– 20 billion:3 billion, daily
@Google
– 7:1
• Bigdata
– Petabyte of primary data
– Across datacenters
2012/3/25 5
6. NoSQL + RDBMS = Megastore
• NoSQL datastore (Bigtable) • Megastore database
– Pros – High scalability
• Highly scalable – Distributed transactions
• Highly available within DC – Consistency guarantees
(across hosts)
– Fully serializable ACID
– Cons
semantics within entity-groups
• Limited API
– Convenience, rapid
• Loose consistency models
development for applications
• Complicate application blend
development
• + High Availability
• RDBMS – Within-DC (Bigtable)
– Pros – Across-DC replication, Paxos
(synchronously write within EG)
• Rich set of features for
convenience, rapid – Strong consistency guarantees
development for applications (synchronously replicate)
• Transactions – Reasonable latency, seamless
failover
• ACID semantics
– Cons
• Difficult to scale
2012/3/25 6
7. Design Principles
• Taking a middle ground in the RDBMS vs. NoSQL design space:
– partition the datastore and
– replicate each partition separately,
– providing full ACID semantics within partitions,
– but only limited/loose consistency guarantees across them.
• Use Paxos to build a highly available system:
– provides reasonable latencies for interactive applications while
– synchronously replicating writes across geographically distributed
datacenters,
– to achieve across-DC high availability and a consistent view of the data.
• Approachs:
– for database scale, partitioning data into a vast space of small
databases, each with its own replicated log stored in a per-replica
Bigtable;
– for availability, implementing a synchronous, fault-tolerant log replicator
optimized for cross-DC replication.
2012/3/25 7
8. EG: Entity-Groups
• Entity-Group concept is the footstone of scalability and availability!
– Fine-grained partitions of data
– Fine-grained control over data’s partitioning and locality
– Like many mini-databases
– To scale throughput and localize outages
– Each independently and synchronously replicated across-DC
The data for most Internet
• An physical EG in Bigtable consist of services can be suitably
– A write-ahead-log (for ACID transactions) partitioned (e.g., by user) to
– Related data (pre-joined) make this approach viable.
– Local indexes (with also ACID)
– … Like a mini-database (locally complete)
Nearly all applications built on
– And a inbox for receiving across-EG messages
Megastore have found ways to
draw EG boundaries.
• Size of a EG
– Not too large, Not too small
– A priori/natural or deliberate grouping of data for fast operations
– If too large: serializable ACID make long latency and low throughput
– If too small: many across-EG expensive consistency operations (e.g. 2PC), or
looser consistency asynchronous messaging
2012/3/25 8
9. Schematic Diagrams
A EG like a mini-DB
WAL (logs)
Primary Data
Local Indexes
Inbox for Queue Messages
EG 2
……
EG n
Megastore layout in Bigtable
2012/3/25 9
10. Many WAL vs. Single WAL
• Many replicated logs each governing its own EG, to improve
availability and throughput.
– Independent and concurrent operations for different EG
– Only operations within a EG need to be serialized
– Temporary long-wait and failed operations does not impact
other EG
• Many WAL to scale throughput and localize outages
• WAL is stored with each EG in Bigtable
• Examples with the same tenet
– The asynchronous and concurrent RPC communication
framework of HBase and Hadoop IPC.
2012/3/25 10
11. Consistency Levels and the Approaches
• Within each EG: Full ACID semantics
– Single-Phase-Commit ACID transactions
– And commit entity is replicated via Paxos across-DC
• Across-EG: Limited consistency guarantees (two methods for tow levels)
– Two-Phase-Commit (expensive, long latency) -> strong consistency
– Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) ->
loose (or eventual) consistency
• Two-phase-commit vs. asynchronous-messaging
– Two-Phase-Commit transactions
• Strong consistency
• Expensive
• Long latency and low throughput
• Usually for low-traffic operations
– Asynchronous-messaging
• Loose consistency, may be inconsistent (or may be eventual consistency)
• Inexpensive
• Usually for heavy-traffic operations
• Objects to be made consistent:
– Data, Local Indexes, within EG : strong (via WAL, ACID)
– Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging)
– Replicas within DC : strong (via GFS and Bigtable)
– Replicas across DC : strong (via Paxos)
2012/3/25 11
12. The two Faces of ACID Transactions
• Frontface:
– Simplify development for applications
– Reasoning about correctness
• Backface:
– Performance reduce
– Latency
– Throughput
2012/3/25 12
13. Architecture of Megastore – How it deploy?
• How it deploy
– a client library (DB logic)
– and auxiliary servers (for across-DC replication)
• Applications link to the client library
2012/3/25 13
14. Data Model and Semantics
to be a database …
2012/3/25 14
15. Principles to be a DBMS
• Provides traditional database features, such as secondary
indexes, etc.
• but only those features that can scale within user-tolerable
latency limits,
• and only with the semantics that EG partitioning scheme can
support.
Feature set carefully chosen, tradeoffs.
2012/3/25 15
16. Data Model (concepts for database)
• A Data Model is a notation for describing data or information.
• Consists of 3 parts, generally
– Structure of the data
– Operations on the data
– Constraints on the data
• Megastore Data Model: Relational Model + Scale
– Limited relational model
– Bigtable’s scalability
• High Level Model vs. Physical Level Model
– Physical Level
• Complicate application development
• Bigtable’s data model is at physical level
– High Level
• Let programmers to write code conveniently
• Language, SQL
2012/3/25 16
17. Data Model
• Schemaful • Primary key
− Strongly typed (Primitives or PB) – Built from a sequence of
− Required, optional or repeated properties
− All entities in a table have the
same set of allowable properties.
– Must be unique within the table
− Nested Protocol-Buffers?
An EG= a root entity + all entities
Entities Properties
Schemas Tables in child tables that reference it
(primary (name,
(name) (name)
key) type)
EG Root Child tables
Property- table (foreign Entities
111
(EG key) key=EG key)
Property-
Entity-11
112 Entity
Table-1
Property- Photo
Entity-12
113 Entity
Schema
User
Entity-21 Entity
Table-2 Book
Entity-22 Entity
schema related hierarchical data
2012/3/25 17
18. SQL-Like Schema Language (DDL)
CREATE SCHEMA DemoApp; Additional Qualifiers:
CREATE TABLE User { DESC|ASC|SCATTER
required int64 userId;
required string name; ------------------------------------
} PRIMARY KEY(userId), ENTITY GROUP ROOT; CREATE TABLE Book{
required int64 userId;
CREATE TABLE Photo { required int32 bookId;
required int64 userId;
required int64 time;
required int32 photoId;
required int64 time; required string url;
required string url; repeated string tag;
optional string thumbUrl; } PRIMARY KEY([DESC|ASC|SCATTER] userId,
repeated string tag; [DESC|ASC|SCATTER] bookId),
} PRIMARY KEY(userId, photoId), IN TABLE User,
IN TABLE User, ENTITY GROUP KEY(userId) REFERENCES User;
ENTITY GROUP KEY(userId) REFERENCES User;
CREATE LOCAL INDEX PhotosByTime CREATE LOCAL INDEX BooksByTime
ON Photo(userId, time); ON Book([DESC|ASC|SCATTER] userId,
[DESC|ASC] time);
CREATE GLOBAL INDEX PhotosByTag
ON Photo(tag) STORING (thumbUrl);
2012/3/25 18
19. Data Placement in Bigtable (principles)
Pre-join with Keys, for performance …
• Lets applications control the placement of hierarchical/related data, to
minimize latency and maximize throughput
– Storing data that is accessed together in nearby rows, or
– Denormalized into the same row
• The data for a EG are held in contiguous ranges of Bigtable rows, for
– Low latency
– High throughput
– Cache efficiency
• Pre-Joining with keys
– Primary keys to cluster entities that will be read together.
– Each entity maps into a single Bigtable row.
– Primary key values are concatenated to form the Bigtable row key
– Each remaining property occupies its own Bigtable column
– Entity-group key as the prefix of Primary key (row key)
– Sorted keys ascending or descending
– SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable
– Recursive for arbitrary join depths (multiple levels of “IN TABLE”)
2012/3/25 19
20. Data Placement in Bigtable (details)
Pre-join with Keys, for performance …
• Bigtable row key = primary key of each table
• Bigtable column name = <table name>.<property name>
– Allowing entities from different Megastore tables to be mapped into the
same Bigtable row without collision.
• Store the transaction and replication log and metadata for the EG
in root entity’s Bigtable row.
– Because Bigtable provides per-row transactions.
• Indexes: Each index entry is represented as a single Bigtable row
– Bigtable row key = <indexed property values> + <primary key>
– Bigtable cell columns: denormalized properties
2012/3/25 20
21. Data Placement in Bigtable (examples)
STORING
Transaction Meta User Table Photo Table Denormalized
Row Key Root. Root. User. Photo. Photo. Photo. Photo. PhotosByTag.
WAL meta name time url thumbUrl tag thumbUrl
<U1> Log3 commit Jack
Log2 offset
Root
User
Log1 applied
offset …
EG for U1
<U1,P1> T1 URL1 TURL1 girl, car
Photo Local Index Global Index
Data PhotosByTime PhotosByTag
<U1,P2> T2 URL2 TURL2 dress, girl
<U1,T1><U1,P1>
<U1,T2><U1,P2>
<car><U1,P1> TURL1
<dress><U1,P2> TURL2
<girl><U1,P1> TURL1
<girl><U1,P2> TURL2
2012/3/25 21
22. Secondary Indexes
• Secondary indexes can be declared on any list of entity
properties(optional is ok), including repeated properties, as well as
sub-fields within ProtocolBuffers, and full-text index.
• Local Indexes
– Within EG
– Obey ACID semantics
• The index entries are stored in the entity group and are updated atomically
and consistently with the primary entity data.
• Global Indexes
– Span EGs
– Looser consistency (or may eventual)
• Not guaranteed to reflect all recent updates. (may inconsistent with the
primary data?)
• It is a trick to keep consistent between Global Indexes and primary data!?
– Special Two-Phase-Commit? and
– Read-Repair?
2012/3/25 22
23. Secondary Indexes and Demoralization
• STORING clause for copied data in index entities
– Avoid the indirect access of primary entities, it is very expensive random access.
– But, keeping consistent is a issue!
• Inline Indexes
– Index entries from the source entities appear as a virtual repeated column in the
target entry.
– An inline index can be created on any table (child) that has a foreign key
referencing another table (parent) by using the first primary key of the target
entity as the first components of the index.
Inline Index
Repeated Columns Inline
User Row Key User. PhotosByTime. PhotosByTime. Photo. Photo.
Parent Table name T1 T2 time thumbUrl
<U1> Jack <P1> <P2>
Photo
Child Table <U1,P1> T1 TURL1
<U1,P2> T2 TURL2
CREATE INLINE INDEX PhotosByTime ON Photo(userId, time);
2012/3/25 23
24. Inline Indexes for many-to-many
Relationships
• Coupled with repeated indexes, inline indexes can also be used to
implement many-to-many relationships more efficiently than by maintaining
a many-to-many link table.
Inline Index
many-to-many
Repeated Columns Inline
Row User. PhotosByTag. PhotosByTag. PhotosByTag. Photo. Photo.
User
Key name car dress girl time thumbUrl
Parent Table
<U1> Jack <P1> <P2> <P1>
<P2>
Photo <U1,P1> T1 TURL1
Child Table <U1,P2> T2 TURL2
<U2> Tom <P1>
<U2,P1> T3 TURL3
CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag);
2012/3/25 24
25. API
• Cost-transparent API
– Match application developers’ intuitions
– High-volume interactive workloads benefit more from predictable performance than from an
expressive query language.
• Normalized relational schemas rely on joins at query time to service user operations, is
not the right model for Megastore applications.
– Pre-joins
– Denormalization
• SQL-Like Schema language (DDL, for data structures and data placement)
– Fine-grained control over physical locality
• Hierarchical layouts (pre-joins)
• Declarative denormalization
– Eliminate the need for most joins
• Queries API against particular tables and indexes
– Range Scans
– Lookups
• Schema changes require corresponding modifications to the query implementation
code
2012/3/25 25
26. Query Joins
• Query Joins, when required, are implemented in application
code.
• Index-based join
• Merge joins
– Multiple queries returns primary keys for the same table, in the
same order.
– Then intersection of keys for them.
• Outer joins
– Index lookup (return small result set)
– Parallel index lookups using the results of the above lookup
• Other joins …?
2012/3/25 26
27. Query Joins - Merge Joins
Query-1
SELECT * FROM Photo WHERE tag=girl
girl & car
Intersection or
& or | girl | car
SELECT * FROM Photo WHERE tag=car
Query-2
Use the global index: PhotosByTag
Just like:
SELECT * FROM Photo WHERE tag=girl AND tag=car
or
SELECT * FROM Photo WHERE tag=girl OR tag=car
Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”.
2012/3/25 27
28. Query Joins - Outer Joins
name=Jack,
userId=U1,U2 Query-2
Query-1 userId=U1,U2
Parallel Index Lookup
Query-2
Index lookup T1<time<T10
Parallel Index Lookup
SELECT name, userId FROM User
SELECT thumbnUrl FROM Photo
WHERE name=Jack
WHERE time>T1 AND time<T10;
(suppose there is a index:
… Parallel for each userId.
UsersByName)
Just like:
SELECT User.name, User.userId, Photo.thumbUrl FROM User
LEFT OUTER JOIN Photo ON Photo.userId=User.userId
WHERE User.name=Jack AND Photo.time>T1 and Photo.time<T10
Example of result:
Jack, U1, TURL1
Jack, U2, NULL
2012/3/25 28
29. Transactions and Concurrency Control
• An EG as a mini-database, serializable ACID transactions .
• Transactions within-EG
– A transaction writes its mutations into the EG's WAL, then the mutations are applied to the data.
– Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates.
• MVCC: Multi-Version Concurrency Control (very important)
– Use Bigtable cell’s timestamps/versions
– Readers and writers don't block each other, and reads are isolated from writes for the duration
of a transaction. (How? See MVCC in Wikipedia)
• Write patterns
– A write transaction always begins with a current read to determine the next available log
position. (This current read only ensures that all previously committed writes to be applied.)
– The commit operation gathers mutations into a log entry, assigns it a timestamp higher than
any previous one, and appends it to the log (and using Paxos for replicate across-DC).
– The write operation can return to the client at any point after Commit.
Write Op Commit Read Op
• Read patterns
– Current Read Metadata and WAL of EG root Check for
– Snapshot Read recover committed logs
ad
In Bigtable
Re
– Inconsistent Read The apply may be async Apply
Tables data and Indexes data
in Bigtable
2012/3/25 29
30. Transactions and Concurrency Control -
Write
Last committed position
Writer
(ts2)
When failure occurs here:
Last fully applied position
(ts1)
Transaction-1:
Metadata
Very safe.
Transaction-3 Writing
Transaction-2:
Serializable Transactions
(ongoing) Writ writing but not commit
e
not gather and append into log
Safe, no data loss, but
Transaction-2
(commited)
Write
Commit Mutation-22-ts2
should be recovered from
committed but not
fully applied
partially
applied data
log to data, when doing
Transaction-1
Writ
Mutation-21-ts2 “current read” or “write”
(committed, applied)
Com e operations.
mit Mutation-12-ts1 Data-part1-ts2
Transactions
Mutation-11-ts1 Data-part2-ts1
Transaction-3:
a log entry
WAL
Data-part1-ts1 Not complete, failed.
assign it a timestamp Application will get failed
committed and
fully applied
Data
(Use Bigtable timestamp
return.
Figure : The state of Transaction System for a EG for MVCC)
Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it.
A write transaction always begins with ensuring that all previously committed writes to be
applied (via a current read)!
2012/3/25 30
31. Transactions Read Patterns and Lifecycle
• Current Read • A complete transaction
– Only within-EG lifecycle
– When starting a current read, – Read
the transaction system first
ensures that all previously • Get timestamp of the last
committed writes are applied. committed transaction from
(Just like the recovery of metadata.
commit-logs.) – Application logic
– Then the application reads at the • Read-modify-write.
timestamp of the latest – Commit
committed transaction.
• Gathers mutations into a log
entry, assigns it a higher
• Snapshot Read timestamp.
– Only within-EG • Replicate across-DC via
– Picks up the timestamp of the Paxos.
last known fully applied • Can return to client here.
transaction and reads from there. ----------------------------------------
– Some committed transactions (following job may be asynchronous)
may not yet be applied. ----------------------------------------
– Apply
• Inconsistent Read • Write mutations into data
– Read the latest values directly, and indexes.
may get partially applied data,
for aggressive latency. – Clean up
• Delete fully applied logs.
2012/3/25 31
32. Transactions Read Patterns – Current Read
(1) Check latest
Current Reader
committed writes
Last committed position
(ts2)
(3) Update metadata
Last fully applied position
(ts1) -> (ts2)
Metadata
Transaction-3 Writing
Serializable Transactions
(ongoing) Writ writing but not commit (4) Read data at ts2
e
not gather and append into log
(2) Apply previous
Transaction-2 Write committed writes
(commited) Commit Mutation-22-ts2
committed but not
fully applied
Transaction-1 Mutation-21-ts2 Data-part2-ts2
(committed, applied) Writ
Com e
mit Mutation-12-ts1 Data-part1-ts2
Transactions
Mutation-11-ts1 Data-part2-ts1
WAL
a log entry Data-part1-ts1
assign it a timestamp
committed and Data
fully applied (Use Bigtable timestamp
Figure : The state of Transaction System for a EG for MVCC)
Do recovery write before read data.
2012/3/25 32
33. Transactions Read Patterns – Snapshot Read
Snapshot Reader
Last committed position
(1) Get last fully applied ts
(ts2)
Last fully applied position
(ts1)
Metadata
Transaction-3 Writing
Serializable Transactions
(ongoing) Writ writing but not commit
e
not gather and append into log
Transaction-2 Write (2) Read data at ts1
(commited) Commit Mutation-22-ts2
committed but not
fully applied
Transaction-1 Mutation-21-ts2
(committed, applied) Writ
Com e
mit Mutation-12-ts1 Data-part1-ts2
Transactions
Mutation-11-ts1 Data-part2-ts1
WAL
a log entry Data-part1-ts1
assign it a timestamp
committed and Data
fully applied (Use Bigtable timestamp
Figure : The state of Transaction System for a EG for MVCC)
The very easy read pattern.
2012/3/25 33
34. Transactions Read Patterns – Inconsistent Read
Inconsistent
Reader
Last committed position
(ts2)
Last fully applied position
(ts1)
Metadata (1) Directly read
partial data
Transaction-3 Writing
Serializable Transactions
(ongoing) Writ writing but not commit
e
not gather and append into log
Transaction-2 Write
(commited) Commit Mutation-22-ts2
committed but not partially
fully applied applied data
Transaction-1 Mutation-21-ts2
(committed, applied) Writ
Com e
mit Mutation-12-ts1 Data-part1-ts2
Transactions
Mutation-11-ts1 Data-part2-ts1
WAL
a log entry Data-part1-ts1
assign it a timestamp
committed and Data
fully applied (Use Bigtable timestamp
Figure : The state of Transaction System for a EG for MVCC)
The application must tolerate the stale or partially applied data.
2012/3/25 34
36. Replication
for High Availability …
I need more study about Paxos, so it is not go-to-
detailed.
2012/3/25 36
37. Replication
• Within-DC
– Across hosts
– Built-in from Bigtable and GFS
• Across-DC
– … synchronous and consistent for each write
2012/3/25 37
38. Replication cross-DC
• Traditional strategies (not work) • EG based synchronously
– Asynchronous Master/Slave
• Asynchronously propagate
replicate each write
• Master supports fast ACID transactions
• Low latency
• Data loss risk • Use Paxos
• Downtime for failover – No distinguished master
• Heavyweight master
• Required a mediate mastership(e.g. – Replicate write-ahead-log
ZooKeeper) – Synchronously replicating
writes (each log append blocks
– Synchronous Master/Slave
• No data loss
on acknowledgments from a
• Downtime for failover majority of replicas, and
• Long latency replicas in the minority catch
• Heavyweight master up as they are able)
• Required a mediate mastership(e.g.
ZooKeeper) – Any node can initiate writes
and reads
– Optimistic Replication
– Reasonable latency
• No distinguished master
• Asynchronously propagate
• Availability and latency are excellent
– Extensions
• No mutation order and transactions
are impossible • Allows local reads at any up-
• Like Cassandra/Dynamo to-date replica
• Permits single-roundtrip writes
2012/3/25 38
39. Paxos
• Traditional usages
– Locking
– Master election
– Replication of metadata and configurations
• Megastore use Paxos
– Replicate primary user data across-DC on every write
– For across-DC high availability
2012/3/25 39
41. Valuable References
• P. Helland. Life beyond distributed transactions: an apostate's
opinion. In CIDR, pages 132-141, 2007.
– The philosophy inspire of Megastore
2012/3/25 41