Our techniques provide fast wavelet tree construction in practice based on recent theoretical work. Experiments on real datasets show our methods using the PEXT and PSHUFB CPU instructions outperform previous approaches. For wavelet trees, our methods are 1.9x faster than naive construction on average and competitive with state-of-the-art. For wavelet matrices, we achieve speedups of 1.1-1.9x over the state-of-the-art. This work provides the first practical implementation of the fastest known wavelet tree construction algorithms.
3. 3
Fast WT construction has attracted much attension!!
Papers
Sequential?
Parallel?
Impl.?
[Fuentes-Sepúlveda+,SEA’14] P Yes
[Shun+,DCC’15] P Yes
[Labeit+,DCC’16] P Yes
[Fischer+,ALENEX’18] S+P Yes
[Munro+,SPIRE’14][Babenko+,SODA’15](Bestupperbound) S No
[Shun+,DCC’17](Bestupperbound) P No
§ Gap between theory and practice:
• Nopracticalimplementationof[Munro+,SPIRE’14][Babenko+,
SODA’15]:Thecurrentfastestwavelettreeconstructionalgorithm.
4. 4
First* practical implementation of wavelet tree construction
based on [Munro+, SPIRE’14][Babenko+, SODA’15]
Main result
§ Our idea: Replace precomputed tables w/
• SpecialCPUinstruction:PEXTin(BMI2)orPSHUFB(inSSSE3).
• Broadwordcomputation(omitted).
§ Experiments on real datasets:
• Wavelettree:ourswerecompetitivetoSOTA[Fischer+,ALENEX'18]
• Waveletmatrix:ourswere1.1–1.9xfasterthanSOTA.
*SzymonGrabowskikindlypointedthatTuukkaNorrialsotriedsimilarapproaches:
github.com/tsnorri/wt-construct-gn
8. 8
6
0110
8
1000
9
1001
4
0100
14
1110
11
1011
1
0001
0
0000
5
0101
7
0111
12
1100
15
1111
13
1101
2
0010
3
0011
10
1010
6
0110
8
1000
9
1001
4
0100
14
1110
11
1011
1
0001
0
0000
Fast construction of wavelet trees
§ Fastwavelettreeconstruction
[Munro+, SPIRE’14][Babenko+,SODA’15]
• Processmultipleelementsatatime.
• w/toft-bitelementscanbereadtogether.
§ Primitiveoperations:
X Bbitpack ( ,i)=
X X0listsplit( ,i)=( , )X1
X1
A subarray consisting of
X’s elements whose 0th bit 1
A subarray consisting of
X’s elements whose 0th bit 0
X0
First w/t elements in a word of w bits (e.g., w = 32 and t = 4)
X
Packed 0th bits of elements contained in X.
B
Assumewordlengthw = 32,inputintegerwidtht =4,andthusw/t = 8 forexplanation.
Assumption:
• ThestandardwordRAM
• w:wordlength(inbits)
• t:inputintegerwidth(inbits)
• t≤w justforexplanation.
(Thisconditioncanbeeliminated.)
9. 9
Main idea: Two special CPU instructions
ParallelbitsEXTract
PEXT(X,Y)=Z
PacksbitsinXaccordingtoY
suchthatforalli,
bit(Z,i)=bit(X,j)holds.
• bit(a,i):i-thbitofa.
• select1(a,i):indexofy’si-th1.
• j=select1(Y,i).
ParallelSHUFfleBytes
PSHUFB(X,Y)=Z
Permutest-bitblocksinXaccordingtoY
suchthatforalli,
block(Z,i)=block(X,j)holds.
• block(a,i):i-tht-bitblockofa.
• j=block(Y,i).
• Inpractice,w=64andt=8arerequired.
10. 10
PEXT-based technique for bitpack
§ Preprocessing:
• L: Packedarraywithblock(L,i) = 1foreveryiin[0,w/t)
(i.e.,eacht-bitblockhas1onlyatitslowestbit).
Assumewordsizew = 32,elementsizet =4,andthusw/t = 8 forexplanation.
01100100000100000101011100100011
00011001000001000001010111001000
>>t-i-1
00011001000001000001010111001000
00010001000100010001000100010001L
Z
Y
Y
X
11001100000000000000000000000000
1. Shift X by t-i-1=2 2. Perform PEXT(Y,L)=Z
PEXT(Y,L)
bitpack(X= ,i=1):01100100000100000101011100100011
6 4 1 0 5 7 2 3
17. 17
Conclusion
§ Practical wavelet tree construction using PEXT/PSHUFB.
• Based on [Munro+, SPIRE’14] and [Babenko+, SODA’15]
§ Experiments on real datasets:
• Wavelet tree: Faster than NAÏVE and competitive w/ SOTA:
prefix sorting (PS) and prefix counting (PC).
• Wavelet matrix: Faster than NAÏVE, PS, and PC.
§ Future work
• Exploit moreparallelismin CPUcoresand/orSIMD registers.