5. Distributed Hash Tables (DHT)
Source: Wikipedia - http://commons.wikimedia.org/wiki/File:DHT_en.svg#/media/File:DHT_en.svg
6. • Decentralized Hash Table functionality
• Interface
• put(K,V)
• get(K) -> V
• Nodes can fail, join and leave
• The system has to scale
Distributed Hash Tables (DHT)
7. • Flooding in N nodes
• put() – store in any node O(1)
• get() – send query to all nodes O(N)
• Full replication in N nodes
• put() – store in all nodes O(N)
• get() – check any node O(1)
Simple solutions
15. Chord: Peer-to-peer Lookup Protocol
• Load Balance – distributed hash function, spreading
keys evenly over nodes
• Decentralization – fully distributed no SPOF
• Scalability – logarithmic growth of lookup cost with
the number of nodes, large systems are feasible
• Availability – automatically adjusts its internal tables
to ensure the node responsible for a key is always
found
• Flexible naming – key-space is flat (flexibility in how
to map names to keys)
16. Chord – Lookup O(N)
Source: Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
Ion Stoica , Robert Morrisz, David Liben-Nowellz, David R. Kargerz, M. Frans Kaashoekz, Frank Dabekz, Hari Balakrishnanz
17. Chord – Lookup O(logN)
Source: Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
Ion Stoica , Robert Morrisz, David Liben-Nowellz, David R. Kargerz, M. Frans Kaashoekz, Frank Dabekz, Hari Balakrishnanz
• K=6 (0, 26−1)
• Finger[i] = first node that succeeds
(N+ 2𝑖−1
) mod 2K
, where 1 ≤ 𝑖 ≤ 𝐾
• Successor/Predecessor – the next/previous node on circle
18. Chord – Node Join
Source: Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
Ion Stoica , Robert Morrisz, David Liben-Nowellz, David R. Kargerz, M. Frans Kaashoekz, Frank Dabekz, Hari Balakrishnanz
• Node 26 joins the system between nodes 21 and 32.
• (a) Initial state: node 21 points to node 32;
• (b) node 26 finds its successor (i.e., node 32) and points to it;
• (c) node 26 copies all keys less than 26 from node 32;
• (d) the stabilize procedure updates the successor of node 21
to node 26.
19. • CAN (Hypercube), Chord (Ring), Pastry (Tree+Ring),
Tapestry (Tree+Ring), Viceroy, Kademlia, Skipnet,
Symphony (Ring), Koorde, Apocrypha, Land,
Bamboo, ORDI …
The world of DHTs …
21. Where do we store data?
One size does not fit all...
22.
23. Infinispan – History
• 2002 – JBoss App Server needed a clustered solution for
HTTP and EJB session state replication for HA clusters.
JGroups (open source group communication suite) had a
replicated map demo, expanded to a tree data structure,
added eviction and JTA transactions.
• 2003 – this was moved to JBoss AS code base
• 2005 – JBoss Cache was extracted and became a standalone
project
… JBoss Cache evolved into Infinispan, core parts redesigned
• 2009 – JBoss Cache 3.2 and Infinispan 4.0.0.ALPHA1 was
released
• 2015 - 7.2.0.Alpha1
• Check the Infinispan RoadMap for more details
28. Infinispan Clustering and Consistent Hashing
• JGroups Views
• Each node has a unique address
• View changes when nodes join, leave
• Keys are hashed using MurmurHash3
algorithm
• Hash Space is divided into segments
• Key > Segment > Owners
• Primary and Backup Owners
29. Does it scale?
• 320 nodes, 3000 caches, 20 TB RAM
• Largest cluster formed: 1000 nodes
38. If multiple nodes fail…
• CAP Theorem to the rescue:
• Formulated by Eric Brewer in 1998
• C - Consistency
• A - High Availability
• P - Tolerance to Network Partitions
• Can only satisfy 2 at the same time:
• Consistency + Availability: The Ideal World where
network partitions do not exist
• Partitioning + Availability: Data might be different
between partitions
• Partitioning + Consistency: Do not corrupt data!
39. Infinispan Partition Handling Strategies
• In the presence of network partitions
• Prefer availability (partition handling DISABLED)
• Prefer consistency (partition handling ENABLED)
• Split Detection with partition handling ENABLED:
• Ensure stable topology
• LOST > numOwners OR no simple majority
• Check segment ownership
• Mark partition as Available / Degraded
• Send PartitionStatusChangedEvent to listeners
42. Merging Split Clusters
• Split Clusters see each other again
• Step1: Ensure stable topology
• Step2: Automatic: based on partition state
• 1 Available -> attempt merge
• All Degraded -> attempt merge
• Step3: Manual
• Data was lost
• Custom listener on Merge
• Application decides
43. Querying Infinispan
• Apache Lucene Index
• Native Query API (Query DSL)
• Hibernate Search and Apache Lucene to index and
search
• Native Map/Reduce
• Index-less
• Distributed Execution Framework
• Hadoop Integration (WIP)
• Run existing map/reduce jobs on Infinispan data