In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.
Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Call me maybe: Jepsen and flaky networks
1.
2. Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.
3. Typical first year for a new
cluster
— Jeff Dean, Google
• ~5 racks out of 30 go
wonky (50% packetloss)
• ~8 network
maintenances (4 might
cause ~30-minute
random connectivity
losses)
• ~3 router failures (have
to immediately pull
traffic for an hour)
LADIS 2009
4. Reliable networks are a
myth
• GC pause
• Process crash
• Scheduling delays
• Network maintenance
• Faulty equipment
8. CAP recap
• Consistency (Linearizability): A total order on all operations such
that each operation looks as if it were completed at a single instant.
• Availability: Every request received by a non-failing node in the
system must result in a response.
• Partition Tolerance: Arbitrary many messages between two nodes
may be lost. Mandatory unless you can guarantee that partitions
don’t happen at all.
10. Jepsen: Testing systems
under stress
• Network partitions
• Random process crashes
• Slow networks
• Clock skew
http://github.com/aphyr/jepsen
11. Anatomy of a Jepsen test
• Automated DB setup
• Test definitions a.k.a Client
• Partition types a.k.a Nemesis
• Scheduler of operations (client & nemesis)
• History of operations
• Consistency checker
Data store specific
(Mongo/Solr/Elastic)
Provided by Jepsen
16. A set of integers: cas-set-client
• S = {1, 2, 3, 4, 5, …}
• Stored as a single document containing all the integers
• Update using compare-and-set
• Multiple clients try to update concurrently
• Create and restore partitions
• Finally, read the set of integers and verify consistency
17. Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
18. Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
History = [(t, op, result)]
19. Solr
• Search server built on Lucene
• Lucene index + transaction log
• Optimistic concurrency, linearizable CAS ops
• Synchronous replication to all ‘live’ nodes
• ZooKeeper for ‘consensus’
• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-
networks/
21. Solr - Are we safe?
• Leaders become unavailable for upto ZK session timeout, typically
30 seconds (expected)
• Some write ‘hang’ for a long time on partition. Timeouts are
essential. (unexpected)
• Final reads under CAS are consistent but we haven’t proved
linearizability (good!)
• Loss of availability for writes in minority partition. (expected)
• No data loss (yet!) which is great!
22. Solr - Bugs, bugs & bugs
• SOLR-6530: Commits under network partition can put any node into
‘down’ state.
• SOLR-6583: Resuming connection with ZK causes log replay
• SOLR-6511: Requests threads hang under network partition
• SOLR-7636: A flaky cluster status API - times out during partitions
• SOLR-7109: Indexing threads stuck under network partition can mark
leader as down
23. Elastic
• Search server built on Lucene
• It has a Lucene index and a transaction log
• Consistent single doc reads, writes & updates
• Eventually consistent search but a flush/commit should ensure that
changes are visible
24. Elastic
• Optimistic concurrency control a.k.a CAS linearizibility
• Synchronous acknowledgement from a majority of nodes
• “Instantaneous” promotion under a partition
• Homegrown ‘ZenDisco’ consensus
25. Elastic - Are we safe?
• “Instantaneous” promotion is not. 90 seconds timeouts to elect a
new primary (worse in <1.5.0)
• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0.
Better in 1.5.0, only 22/897 lost.
• Isolated primaries: 209/947 updates lost
• Repeated pauses (simulating GC): 200/2143 updates lost
• Getting better but not quite there. Good documentation on
resiliency problems.
26. MongoDB
• Document-oriented database
• Replica set has a single primary which accepts writes
• Primary asynchronously replicates writes to secondaries
• Replica decide between themselves to promote/demote primaries
• Applies to 2.4.3 and 2.6.7
27. MongoDB
• Claims atomic writes per document and consistent reads
• But strict consistency only when reading from primaries
• Eventual consistency when reading from secondaries
28. MongoDB - Are we safe?
Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
29. MongoDB - Are we really safe?
• Inconsistent reads are possible even with majority write concern
• Read-uncommitted isolation
• A minority partition will allow both stale reads and dirty reads
30. Conclusion
• Network communication is flaky! Plan for it.
• Hackernews driven development (HDD) is not a good way of
choosing data stores!
• Test the guarantees of your data stores.
• Help me find more Solr bugs!
31. References
• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen
• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-
jepsen-flaky-networks/
• Jepsen on github: github.com/aphyr/jepsen
• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
32. Solr/Lucene Meetup on 25th July 2015
Venue: Target Corporation, Manyata Embassy Business Park
Time: 9:30am to 1pm
Talks:
Crux of eCommerce Search and Relevancy
Creating Search Analytics Dashboards
Signup at http://meetu.ps/2KnJHM