3. 3
1. Introduction (1)
• Interactive online services
demand
– High scalability
– Rapid development
– Low latency
– Consistency of data
– High availability
→ conflicting requirements
Solution Megastore
– Scalability NoSQL
→ partition + replicate
– Convenience RDBMS
→ ACID semantic
within partition
– High availability
4. 4
1. Introduction (2)
Widely deployed in Google for several years
>100 production applications.
3 billion writes and 20 billion reads daily
A petabyte of data across multiple datacenters
Available on GAE since Jan 2011.
5. 5
2.1 Availability and scalability
• Availability: Paxos
→ fault-tolerant consensus
algorithm
– No master
– Replicate logs
• Scale:
– Partition data in small
databases
– Each partition own
replicated log
8. 8
3.1 Megastore features: API
• Megastore = cost-transparent API
– No expressive queries
– Storing and querying hierarchical data in
key-value store is easy
– Joins in application logic:
• Merge phase supported
• Outer joins based on indexes
→ understandable performance implications
9. 9
3.2 Megastore features: Data model
• Megastore Tables:
– Entity group root
– Child table: reference to
root
• Entity: single row
→ identified by concatenation of
keys
10. 10
3.2 Megastore features: Indexes
• 2 levels of indexes:
– Local: for each entity group
• updated atomically and consistently
– Global: spans entity groups
• Find entities without keys
• Not all updates visible
11. 11
3.2 Megastore features: Bigtable
Primary Keys cluster entities together
Each entity = single Bigtable row.
“IN TABLE” includes tables into single Bigtable
→ key ordering ensure entities are stored adjacent
Bigtable column name = Megastore table name +
property name
13. 13
3.3 Megastore features: Transactions (2)
• Three levels of read consistency
– Current: read EG after write logs are committed
– Snapshot: read last completed transaction of EG
– Inconsistent: ignore log and read latest values
14. 14
3.3 Megastore features: Transactions (3)
Write transaction:
― Current read: Obtain the timestamp and log position of the
last committed transaction
―
Application logic: Read from Bigtable and gather writes into
a log entry
― Commit: Use Paxos to achieve consensus for appending the
log entry to log
― Apply: Write mutations to the entities and indexes in Bigtable
― Clean up: Delete temp data
15. 15
3.3 Megastore features: Transactions (4)
Queue: transactional messaging between EG
― Transaction atomically handles messages
―
Perform operations on many EG
―
Associated with each EG (scalable)
Two phase commit
Queue is preferred over two phase commit.
19. 20
4.2 Replication: Paxos adaptation
• Fast reads: local through coordinators
– Eliminates prepare phase
– Coordinator: controls EG up-to-date
→ Simple because no database
• Fast writes: through leaders
– Eliminate prepare phase
– Multiple writes issued to same leader
– Leader = closest replica to the writer
20. 21
4.3 Replication: Algorithms (1)
• Replica stores log entries
of EG
→ can accept out of order
• Read:
→ >=1 replica up-to-date
21. 22
4.3 Replication: Algorithms (2)
• Prep: Package changes +
timestamp + leader as log
• Write not succeed:
invalidate
• Data only visible after
invalidate step
22. 23
4.4 Replication: Coordinator availability
• Coordinator: in each datacenter
→ keep state local replica
→ simple process = more stable
• Failure detection:
– Chubby lock: other coordinators online
→ Looses majority locks: all EG out-of-date
– Datacenter failure: writers wait for the locks of
coordinators to expire before write can be
completed
• Validation races:
– Always send log position
– Higher number wins
23. 24
4.5 Replication: Replica types
Full replicas
― What we have seen until now
Witness replica:
– Can vote
– Store write-ahead logs but not the data
Read-only replica:
– Cannot vote
– full snapshot data
25. 26
5. Results
• Read latency: 10+ ms
• Write latency: 300 ms
• Issues: Replica unavailable
• Solution
– Reroute traffic to
servers nearby
– Disable the replica
coordinator
26. 27
6. Conclusion
• Scalability and Availability
• Simpler reasoning and usage
→ ACID semantics
• Latency:
→ best effort (low enough for interactive apps)
• Throughput within EG: few per second
→ not enough: sharding EG or placement replicas
near each other
28. 29
Question1
As being stated several times, only 2 elements of CAP can be kept. Megastore
focusses on which two and how?
Reasoning:
– Partition tolerance: dividing database into EG and replicating these
over multiple data centers
– Availability: providing a service that is highly available through
Paxos
– Consistency: Relaxed consistency between EG and global indexes
29. 30
Question2
Current reads have the following guarantees:
- A read always observes the last-acknowledged write.
- After a write has been observed, all future reads observe that write. (A write
might be observed before it is acknowledged.)
Contradiction?
→ No I do not think so but it is confusing
Reasoning:
– Two guarantees are focused on current reads (this are the reads
that preserve consistency)
– The sentence between parentheses mentions that inconsistent
reads are also possible but they are not current reads
30. 31
Question3
In my opinion, a lot of their focus goes towards making the system consistent. But
in their API they also give you the possiblity to request current, snapshot and
inconsequent data. Do you think this is a valuable addition?
Reasoning:
→ Mainly due to performance bottleneck of consistent system
→ Depends on the application: There exist applications where you do not mind
that you read something inconsistent
→ Current and snapshot still maintain consistency
=> Personally I think the value is limited: cannot think of an application where
latency is so critical that rather wants inconsistent data than wait a bit longer
31. 32
Question4
4.4.3: For me it is not clear how the 'read-only' replicas receive their data, as it
needs to get consistent data. Do you have an idea?
→ not mentioned in paper
Idea:
– Coordinator of replica keeps track of the up-to-date EG
– Mechanism that periodically takes a snapshot of these up-to-date
EG
and copies it to the read-only replicas
32. 33
Question5
Megastore is multiple times compared with Bigtable, maybe could you give what
according to you are the biggest differences in implementation an in types of
usage?
Reasoning:
– Build on top of bigtable and based on different requirements:
→ consistency guarantees + wide-area communication
– Bigtable used within one data center ↔ MegaStore across multiple
→ increased availability (Paxos) but higher latency
– Consistency guarantees from Megastore: I suspect lower
performance, throughput than with bigtable
– Implementation:
• Bigtable: master ensures replication ↔ Paxos (no master
recovery)
• Bigtable: one log for each tablet server ↔ 1 log per EG in
replica
• Very different APIs: Megastore supports schemas + indexes
– Note: Google App Engine moved from BigTable to Megastore
33. 34
References
MacDonald A., Paxos by example,
http://angusmacdonald.me/writing/paxos-by-example/, accessed 06-05-14
Google App Engine, Switch from Bigtable to MegaStore
http://googleappengine.blogspot.be/2009/09/migration-to-better-datastore.html,
accessed 08-05-13
Hinweis der Redaktion
Catchup: if no known-committed value for that log.
→ initiate no-op paxos
→ paxos will converge to accepted value or no-op
Coordinators: out of band protocol to check if offline
Looses majority of its locks: subsequent reads each ensure that the replica has a new lock
→ if it has majority: can handle requests again