We present the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention
2. The Pool
Producer-consumer pools, that is, collections of
unordered objects or tasks, are a fundamental
element of modern multiprocessor software and a
target of extensive research and development
Get( )
P1 Put(x)
.
.
P2
C1
.
.
C2
Put(y)
Get( )
Pn Put(z)
Get( )
pool
Cn
3. ED-Tree Pool
We present the ED-Tree, a distributed pool
structure based on a combination of the
elimination-tree and diffracting-tree
paradigms, allowing high degrees of
parallelism with reduced contention
4. Java JDK6.0:
SynchronousQueue/Stack
(Lea, Scott, and Shearer)
- pairing
up function without buffering. Producers and consumers wait for
one another
LinkedBlockingQueue
- Producers put their value and
leave, Consumers wait for a value to become available.
ConcurrentLinkedQueue
- Producers put their value
and leave, Consumers return null if the pool is empty.
5. Drawback
All these structures are based on a centralized
structures like a lock-free queue or a stack,
and thus are limited in their scalability: the
head of the stack or queue is a sequential
bottleneck and source of contention.
6. Some Observations
A
pool does not have to obey neither LIFO or
FIFO semantics.
Therefore, no centralized structure needed,
to hold the items and to serve producers and
consumers requests.
7. New approach
ED-Tree: a combined variant of
the diffracting-tree structure (Shavit and Zemach) and
the elimination-tree structure (Shavit and Touitou)
The basic idea:
Use randomization to distribute the concurrent
requests of threads onto many locations so that they
collide with one another and can exchange values,
thus avoiding using a central place through which all
threads pass.
The result:
A pool that allows both parallelism and reduced
contention.
8. A little history
Both
diffraction and elimination were
presented years ago, and claimed to be
effective through simulation
However, elimination trees and diffracting
trees were never used to implement real
world structures
Elimination and diffraction were never
combined in a single data structure
9. Diffraction trees
A binary tree of objects called balancers [Aspnes-Herlihy-Shavit] with
a single input wire and two output wires
5
4
3
2
1
b
1
3
2
5
4
Threads arrive at a balancer and it repeatedly sends them left and right,
so its top wire always has maximum one more than the bottom one.
11. Diffraction trees
Connect each output wire to a lock free queue
b
b
b
b
b
b
b
To perform a push, threads traverse the balancers from the root to the leaves and
then push the item onto the appropriate queue.
To perform a pop, threads traverse the balancers from the root to the leaves and
then pop from the appropriate queue/block if the queue is empty.
13. Diffraction trees
Observation:
If an even number of threads pass through a balancer, the
outputs are evenly balanced on the top and bottom wires, but
the balancer's state remains unchanged
The approach:
Add a diffraction array in front of each toggle bit
0/1
Prism Array
toggle bit
14. Elimination
At
any point while traversing the tree, if
producer and consumer collide, there is no
need for them to diffract and continue
traversing the tree
Producer
can hand out his item to the
consumer, and both can leave the tree.
16. Using elimination-diffraction balancers
Let the array at balancer each be
a diffraction-elimination array:
If two producer (two consumer) threads meet in the
array, they leave on opposite wires, without a need to
touch the bit, as anyhow it would remain in its original
state.
If producer and consumer meet, they eliminate,
exchanging items.
If a producer or consumer call does not manage to
meet another in the array, it toggles the respective bit of
the balancer and moves on.
18. What about low concurrency
levels?
We
show that elimination and diffraction
techniques can be combined to work well at
both high and low loads
To insure good performance in low loads we use
several techniques, making the algorithm adapt
to the current contention level.
19. Adaptation mechanisms
Use backoff in space:
Randomly choose a cell in a certain range of the array
If the cell is busy (already occupied by two threads), increase the range and
repeat.
Else Spin and wait to collision
If timed out (no collision)
Decrease the range and repeat
If certain amount of timeouts reached, spin on the first cell of the array for a
period, and then move on to the toggle bit and the next level.
If certain amount of timeouts was reached, don’t try to diffract on any of the
next levels, just go straight to the toggle bit
Each thread remembers the last range it used at the current balancer and next
time starts from this range
20. Starvation avoidance
Threads
that failed to eliminate and propagated
all the way to the leaves can wait for a long time
for their requests to complete, while new threads
entering the tree and eliminating finish faster.
To
avoid starvation we limit the time a thread
can be blocked in the queues before it retries
the whole traversal again.
21. Implementation
Each
balancer is composed from
an elimination array, a pair of toggle bits, and
two references one to each of its child nodes.
public class Balancer
{
ToggleBit producerToggle, consumerToggle;
Exchanger[] eliminationArray;
Balancer leftChild , rightChild;
ThreadLocal<Integer> lastSlotRange;
}
23. Implementation
Starting from the root of the tree:
Enter balancer
Choose a cell in the array and try to collide with another thread,
using backoff mechanism described earlier.
If collision with another thread occurred
If both threads are of the same type, leave to the next level balancer
(each to separate direction)
If threads are of different type, exchange values and leave
Else (no collision) use appropriate toggle bit and move to next
level
If one of the leaves reached, go to the appropriate queue and
Insert/Remove an item according to the thread type
24. Performance evaluation
Sun UltraSPARC T2 Plus multi-core machine.
2 processors, each with 8 cores
each core with 8 hardware threads
64 way parallelism on a processor and 128 way
parallelism across the machine.
Most of the tests were done on one processor. i.e.
max 64 hardware threads
25. Performance evaluation
A tree with 3 levels and 8 queues
The queues are
SynchronousBlocking/LinkedBlocking/ConcurrentLinked,
according to the pool specification
b
b
b
b
b
b
b
The theme is building a data structure that is used as a pool, making it scalable and usable for high loads, and not less usable than existing implementations for low loads.
What is a pool? A collection of items, which my be objects or tasks. Resource pool – objects that are used and then returned to the pool, Pool of jobs to perform, etc…
The pool is approached by Producers and Consumers, that perform Put/Get (Push/Pop, Enqueue/Dequeue) actions.
These actions can implement different semantics, be blocking/non-blocking, depends on how the pool was defined (Explanation of blocking
on blocking)
The data structure we present is called ED-Tree and this is a highly scalable pool to, to be used in multithreaded application. We reach high performance and scalability by combining two paradigms: Elimination and diffraction
The Ed-Tree is implemented in Java
If we look in Java JDK for data structures that can be used as pool, we will find the following…
All the mentioned data structures are problematic…. They are based on centralized structures… the head or tail of queue/stack becomes a hot spot and in case large number of threads performance becomes worse, instead of improving
If we think about it, we don’t care about the order in which the items are inserted/removed from the pool. All we want is to avoid starvation (if item is inserted to the pool, eventually it will be removed).
Therefore we can avoid using centralized structure and distribute the pool in memory.
A single level of an elimination array was also used in implementing shared concurrent stacks. However, elimination trees and diffracting trees were never used to implement real world structures. This is
mostly due the fact that there was no need for them: machines with a sufficient level of concurrency and low enough interconnect latency to benefit from them did not exist. Today, multi-core machines present the necessary combination of high levels of parallelism and low interconnection costs. Indeed, this paper is the first to show that that ED-Tree based implementations of data structures from the java.util.concurrent
scale impressively on a real machine (a Sun Maramba multicore machine with 2x8 cores and 128 hardware threads), delivering throughput that is at high concurrency levels 10 times that of the new proposed JDK6.0 algorithms.
A balancer is usually implemented as a toggle bit: a bit that holds a binary value. Each thread change the value to the opposite one and picks a direction to exit, according to the bit value. For example 0 – go left, 1 – go right.
The diffraction tree constructed from a set of balancers…. You can say that the tree counts the elements, i.e. distributes them equally across the leafs…
If we connect a lock free queue/stack to each leaf and use two toggle bits in each balancer, we get a data structure which obeys a pool semantics…
We can see that we just moved our contention source from a single queue/stack to the balancers, starting from the entrance to the tree
The problem is solved by diffraction… what we get eventually is that each thread that approaches the pool, traverses the whole tree and eventually reaches one of the queues at the leafs.
Actually, if at some point during the tree traversal a producer and consumer threads meet each other, they don’t have to continue traversing the tree. The consumer can take the producers value, and they both can leave the tree.
In high loads, according to our statistics 50% of the threads are successfully eliminated on each level. I.e. if we use 3-level tree, 50% are eliminated at the first level, another 25% on the second, and 12.5% on the third, meaning, only about 10% of the requests survive till reaching the leaves.
We also use two toggle bits at each balancer – one for producers and one for consumers, to assure fair distribution
In the described implementation, another problem we can encounter is starvation…
Each balancer is composed from an EliminationArray, a pair of toggle bits, and two references one to each of its child nodes.
The implementation of an eliminationArray is based on an array of Exchangers. Each exchanger contains a single AtomicReference which is used as an Atomic placeholder for exchanging ExchangerPackage, where the ExchangerPackage is an object used to wrap the actual data and to mark its state and type.
At its peak at 64 threads the ED-Tree delivers more than 10 times the performance of the JDK.
Beyond 64 threads the threads are no longer bound to a single CPU, and traffic across the interconnect causes a moderate performance decline for the ED-Tree version
(the performance of the JDK is already very low).