SlideShare ist ein Scribd-Unternehmen logo
1 von 29
AA-sort with SSE4.1


              Cybozu Labs
  2012/6/16 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 4(#x86opti)
Agenda
 Introduction of AA-sort
   classic combsort
   vectorized combsort
   vectorized merge
 benchmark




2012/6/16 #x86opti 4        2 /29
AA-sort
 Aligned-Access sort
   proposed by Hiroshi Inoue, etc. in
    "A high-performance sorting algorithm for multicore
    single-instruction multiple-data processors," 2011
      http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm
      http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm
   For SIMD
     less conditional branch, no unaligned data access
   For multicore processors
     they implemented it for PowerPC and Cell BE
   O(n log n) complexity
 I tried it for Intel CPU(not complete)
   https://github.com/herumi/opti/blob/master/intsort.hpp
     current version is for only one processor
2012/6/16 #x86opti 4                                                      3 /29
AA-sort
 vectorized combsort for a block (<= L2cache?)
 vectorized merge sorted block

                                         input array

          block 0          block 1        block 2        block3   ...

             sort             sort           sort          sort

             <               <               <             <      ...

                           merge                        merge
                       <                            <             ...
                                         merge
                                     <                            ...
2012/6/16 #x86opti 4                                                    4 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      5 /29
classic combsort(1/2)
 improved bubble sort
   unstable
   O(n log n)
   compare two elements having a gap(>=1)
     gap is divided by shrink factor (about 1.3)
     size_t nextGap(size_t N) { return (N * 10) / 13; }

     void combsort(uint32_t *a, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);
         }
         gap = nextGap(gap);
       }
       …
2012/6/16 #x86opti 4                                             6 /29
classic combsort(2/2)
 gap = 1 means bubble sort
   loop until the array is fully sorted

           …
           for (;;) {
             bool isSwapped = false;
             for (size_t i = 0; i < N - 1; i++) {
               if (a[i] > a[i + 1]) {
                 std::swap(a[i], a[i + 1]);
                 isSwapped = true;
               }
             }
             if (!isSwapped) return;
           }
       }


2012/6/16 #x86opti 4                                7 /29
gap function
 Combsort11
   last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good
    by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm


      size_t nextGap(size_t n) {
          n = (n * 10) / 13;
          if (n == 9 || n == 10) return 11; // (*)
          return n;
      }



   a little faster if line(*) is appended



2012/6/16 #x86opti 4                                               8 /29
vectorized combsort
 step1 : sort values within each vector(32bitx4)
 step2 : SIMD version combsort
 step3 : reorder data
       6       8        9    3    5      7       12    14    0    4        1        20     11    ...

                                 step1
                                                      sort                                sort
  +0       3       5        0     …          …                        0         1          3       …   101
  +1       9       7        1     …          …                    102          104        105      …   380
  +2       6       12       4     …          …                    389          391        392      …   502
  +3       8       14       20    …          …
                                                        step2
                                                                  511          515        612      …   973
        v0         v1       v2    v3
                                                                                    step3

       0       1        3    …   101   102   104       105   …   380      389       391   392    …

2012/6/16 #x86opti 4                                                                                         9 /29
step1
 step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0,
  1, 2, 3
 step1.2 : transpose
  3       5      0     8

  2       7      1     2
                            step1.1
  8      12      4     13

  9      14      20    15
                                          sort

 v0      v1     v2     v3        0    3          5    8
                                                           step1.2
                                 1    2          2    7

                                 4    8          12   13
                                                                     transpose
                                 9    14         15   20
                                                                 0   1     4     9

                                                                 3   2     8     14

                                                                 5   2    12     15

                                                                 8   7    13     20

2012/6/16 #x86opti 4                                                              10 /29
sort of 4 items
 use max ud, minud for uint32_t x 4
        a                 b

                  <                 v0                v1              v2              v3

    min(a,b)           max(a,b)             <                                   <

                                   min01            max01           min23           max23

                                                <                           <
                                                s=max(min          t=min(max
                                  min0123                                           max0123
                                                01,min23)          01,max23)
                                                               <

                                  min0123           min(s,t)        max(s,t)        max0123


                                                                           sorted

2012/6/16 #x86opti 4                                                                        11 /29
source of step1.1
 V128 is a type of 32-bit integer x 4
   pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3

                 void sort_step1_vec(V128 x[4])
                 {
                     V128 min01 = pminud(x[0], x[1]);
                     V128 max01 = pmaxud(x[0], x[1]);
                     V128 min23 = pminud(x[2], x[3]);
                     V128 max23 = pmaxud(x[2], x[3]);
                     x[0] = pminud(min01, min23);
                     x[3] = pmaxud(max01, max23);
                     V128 s = pmaxud(min01, min23);
                     V128 t = pminud(max01, max23);
                     x[1] = pminud(s, t);
                     x[2] = pmaxud(s, t);
                 }


2012/6/16 #x86opti 4                                    12 /29
transpose of 4x4 matrix
 use unpcklps and unpckhps
                                 t0=unpcklps(x0,x2)
+0     3       5        0   8                         3    5     8    12

+1     2       7        1   2
                                 t2=unpckhps(x0,x2)   0    8     4    13

+2     8      12        4   13                        2    7     9    14

+3     9      14       20   15   t1=unpcklps(x1,x3)   1    2     20   15
                                 t3=unpckhps(x1,x3)
      x0      x1       x2   x3                        t0   t1   t2    t3



       3       5        8   12   x0=unpcklps(t0,t1)   3    2    8     9
       0       8        4   13                        5    7    12    14
       2       7        9   14   x1=unpckhps(t0,t1)   0    1    4     20
       1       2       20   15                        8    2    13    15
                                 x2=unpcklps(t2,t3)
      t0      t1       t2   t3   x3=unpckhps(t2,t3)   x0   x1   x2    x3

2012/6/16 #x86opti 4                                                       13 /29
source of transpose and step1
  void transpose(V128 x[4])       void sort_step1(V128 *va, size_t N)
  {                               {
    V128 x0 = x[0];                 for(size_t i = 0; i < N; i+= 4) {
    V128 x1 = x[1];                   sort_step1_vec(&va[i]);
    V128 x2 = x[2];                   transpose(&va[i]);
    V128 x3 = x[3];                 }
    V128 t0 = unpcklps(x0, x2);   }
    V128 t1 = unpcklps(x1, x3);
    V128 t2 = unpckhps(x0, x2);
    V128 t3 = unpckhps(x1, x3);
    x[0] = unpcklps(t0, t1);
    x[1] = unpckhps(t0, t1);
    x[2] = unpcklps(t2, t3);
    x[3] = unpckhps(t2, t3);
  }



2012/6/16 #x86opti 4                                              14 /29
SIMD version combsort
 first half code use
   vector_cmpswap
   vector_cmpswap_skew
     bool sort_step2(V128 *va, size_t N) {
       size_t gap = nextGap(N);
       while (gap > 1) {
         for (size_t i = 0; i < N - gap; i++) {
           vector_cmpswap(va[i], va[i + gap]);
         }
         for (size_t i = N - gap; i < N; i++) {
           vector_cmpswap_skew(va[i], va[i + gap - N]);
         }
         gap = nextGap(gap);
       }
       ...


2012/6/16 #x86opti 4                                      15 /29
vector_cmpswap
 no conditional branch
           a              b

                  <

       min(a,b)        max(a,b)


     if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]);

                                  vectorised

     void vector_cmpswap(V128& a, V128& b)
     {
       V128 t = pmaxud(a, b);
       a = pminud(a, b);
       b = t;
     }

2012/6/16 #x86opti 4                                       16 /29
vector_cmpswap_skew
 for boundary of array

       a               a3      a2           a1           a0



       b               b3      b2           b1           b0


                                           (a',b') = vector_cmpswap_ske(a,b)

       a'              a3   min(a2,b3)   min(a1,b2)   min(a0,b1)



       b'        max(a2,b3) max(a1,b2) max(a0,b1)        b0




2012/6/16 #x86opti 4                                                           17 /29
isSortedVec
 check whether array is sorted
   ptest_zf(a, b) is true if (a & b) == 0
   a <= b  max(a,b) == b  c := max(a,b) – b == 0
   pcmpgtd is for int32_t, so we can't use it
          bool isSortedVec(const V128 *va, size_t N) {
            for (size_t i = 0; i < N - 1; i++) {
              V128 a = va[i];
              V128 b = va[i + 1];
              V128 c = pmaxud(a, b);
              c = psubd(c, b);
              if (!ptest_zf(c, c)) {
                return false;
              }
            }
            return true;
          }
2012/6/16 #x86opti 4                                     18 /29
loop for gap == 1
 vectorised bubble sort for gap == 1
   retire if loop count reaches maxLoop
     fall to std::sort
         almost rare
            const int maxLoop = 10;
            for (int i = 0; i < maxLoop; i++) {
              for (size_t i = 0; i < N - 1; i++) {
                vector_cmpswap(va[i], va[i + 1]);
              }
              vector_cmpswap_skew(va[N - 1], va[0]);
              if (isSortedVec(va, N)) return true;
            }




2012/6/16 #x86opti 4                                   19 /29
AA-sort algorithm
 sort each block
   O(n log n)
 merge sorted block
   O(n)




2012/6/16 #x86opti 4      20 /29
merge two sorted vector
   a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted
   c = [b:a] = merge and sort (a, b)
                                                      sorted

                   a        a0   a1     a2       a3

                                                       sorted
                   b        b0   b1     b2       b3


                                      [b:a] = vector_merge(a,b)


        c0             c1   c2   c3     c0       c1       c2      c3

                                                                  sorted



2012/6/16 #x86opti 4                                                   21 /29
data flow of merge
                                   sorted                                          sorted


     a0          a1        a2          a3            b0          b1         b2          b3




           <                       <                       <                       <
   min00       max00       min11       max11       min22       max22       min33       max33



                       <                                               <




                       <                       <                       <


2012/6/16 #x86opti 4                                                                         22 /29
source of vector_merge
 Too complex
   good idea?          void vector_merge(V128& a, V128& b) {
                         V128 m = pminud(a, b);
                         V128 M = pmaxud(a, b);
                         V128 s0 = punpckhqdq(m, m);
                         V128 s1 = pminud(s0, M);
                         V128 s2 = pmaxud(s0, M);
                         V128 s3 = punpcklqdq(s1, punpckhqdq(M, M));
                         V128 s4 = punpcklqdq(s2, m);
                         s4 = pshufd<PACK(2, 1, 0, 3)>(s4);
                         V128 s5 = pminud(s3, s4);
                         V128 s6 = pmaxud(s3, s4);
                         V128 s7 = pinsrd<2>(s5, movd(s6));
                         V128 s8 = pinsrd<0>(s6, pextrd<2>(s5));
                         a = pshufd<PACK(1, 2, 0, 3)>(s7);
                         b = pshufd<PACK(3, 2, 0, 1)>(s8);
                       }
2012/6/16 #x86opti 4                                                   23 /29
std::merge()
 merge [begin1, end1) and [begin2, end2)
 template <class In1, class In2, class Out>
 Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out)
 {
   for (;;) {
     *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++;
     if (begin1 == end1) return copy(begin2, end2, result);
     if (begin2 == end2) return copy(begin1, end1, result);
   }
 }




2012/6/16 #x86opti 4                                              24 /29
vectorised merge
 merge arrays with vector_merge()
 void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){
   uint32_t aPos = 0, bPos = 0, outPos = 0;
   V128 vMin = va[aPos++];
   V128 vMax = vb[bPos++];
   for (;;) {
     vector_merge(vMin, vMax);
     vo[outPos++] = vMin;
     if (aPos < aN) {
       if (bPos < bN) {
         V128 ta = va[aPos];
         V128 tb = vb[bPos];          ; compare ta0 with tb0
         if (movd(ta) <= movd(tb)) {
           vMin = ta;
           aPos++;
         } else {
           vMin = tb;
           bPos++;
         }

2012/6/16 #x86opti 4                                                     25 /29
block size and rate of sort
 What is good size for vectorised sort?
   half size of L2 is recommended for PowerPC 970MP
     L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t
 BS = 32Ki seems good for Xeon, Core i7
 profile of sort and merge
        100
         80
         60
         40                                         merge(%)
         20                                         sort(%)
           0




2012/6/16 #x86opti 4                                           26 /29
Benchmark(1/3)
 AA-sort vs std::sort for random data
   Xeon X5650 + gcc-4.6.3
      4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi
                   10000000
                                                      std::sort                                fast
                    1000000
                                                      AA-sort
                     100000
     clock cycle




                      10000
                       1000
                        100
                         10
                          1
                              16   64   256   1Ki   4Ki   16Ki    64Ki   256Ki   1Mi   4Mi
                                                                                             # of uint32_t


2012/6/16 #x86opti 4                                                                                    27 /29
Benchmark(2/3)
 sort 64Ki uint on Xeon + gcc-4.6.3
   AA-sort speed does not strongly depend on pattern
    25000
                                               fast
    20000
                       std::sort
    15000              AA-sort

    10000

     5000

          0




2012/6/16 #x86opti 4                                    28 /29
Benchmark(3/3)
 sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11

   16000
                                        fast
   14000
   12000
   10000
                                        std::sort(gcc)
    8000
                                        AA-sort(gcc)
    6000
                                        std::sort(VC)
    4000
                                        AA-sort(VC)
    2000
         0




2012/6/16 #x86opti 4                                29 /29

Weitere ähnliche Inhalte

Was ist angesagt?

AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Inc.
 
AtCoder Beginner Contest 020 解説
AtCoder Beginner Contest 020 解説AtCoder Beginner Contest 020 解説
AtCoder Beginner Contest 020 解説AtCoder Inc.
 
部内勉強会 数え上げの基礎
部内勉強会 数え上げの基礎部内勉強会 数え上げの基礎
部内勉強会 数え上げの基礎Kazuma Mikami
 
AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Inc.
 
AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Inc.
 
AtCoder Regular Contest 034 解説
AtCoder Regular Contest 034 解説AtCoder Regular Contest 034 解説
AtCoder Regular Contest 034 解説AtCoder Inc.
 
AtCoder Regular Contest 033 解説
AtCoder Regular Contest 033 解説AtCoder Regular Contest 033 解説
AtCoder Regular Contest 033 解説AtCoder Inc.
 
CODE FESTIVAL 2015 予選A 解説
CODE FESTIVAL 2015 予選A 解説CODE FESTIVAL 2015 予選A 解説
CODE FESTIVAL 2015 予選A 解説AtCoder Inc.
 
AtCoder Regular Contest 026 解説
AtCoder Regular Contest 026 解説AtCoder Regular Contest 026 解説
AtCoder Regular Contest 026 解説AtCoder Inc.
 
AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Inc.
 
AtCoder Regular Contest 031 解説
AtCoder Regular Contest 031 解説AtCoder Regular Contest 031 解説
AtCoder Regular Contest 031 解説AtCoder Inc.
 
AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Inc.
 
AtCoder Regular Contest 043 解説
AtCoder Regular Contest 043 解説AtCoder Regular Contest 043 解説
AtCoder Regular Contest 043 解説AtCoder Inc.
 
AtCoder Regular Contest 032 解説
AtCoder Regular Contest 032 解説AtCoder Regular Contest 032 解説
AtCoder Regular Contest 032 解説AtCoder Inc.
 
AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Inc.
 
AtCoder Beginner Contest 025 解説
AtCoder Beginner Contest 025 解説AtCoder Beginner Contest 025 解説
AtCoder Beginner Contest 025 解説AtCoder Inc.
 
Indeedなう 予選B 解説
Indeedなう 予選B 解説Indeedなう 予選B 解説
Indeedなう 予選B 解説AtCoder Inc.
 
AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説AtCoder Inc.
 

Was ist angesagt? (20)

AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説AtCoder Beginner Contest 023 解説
AtCoder Beginner Contest 023 解説
 
AtCoder Beginner Contest 020 解説
AtCoder Beginner Contest 020 解説AtCoder Beginner Contest 020 解説
AtCoder Beginner Contest 020 解説
 
部内勉強会 数え上げの基礎
部内勉強会 数え上げの基礎部内勉強会 数え上げの基礎
部内勉強会 数え上げの基礎
 
AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説
 
AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説
 
AtCoder Regular Contest 034 解説
AtCoder Regular Contest 034 解説AtCoder Regular Contest 034 解説
AtCoder Regular Contest 034 解説
 
写像 12 相
写像 12 相写像 12 相
写像 12 相
 
AtCoder Regular Contest 033 解説
AtCoder Regular Contest 033 解説AtCoder Regular Contest 033 解説
AtCoder Regular Contest 033 解説
 
CODE FESTIVAL 2015 予選A 解説
CODE FESTIVAL 2015 予選A 解説CODE FESTIVAL 2015 予選A 解説
CODE FESTIVAL 2015 予選A 解説
 
AtCoder Regular Contest 026 解説
AtCoder Regular Contest 026 解説AtCoder Regular Contest 026 解説
AtCoder Regular Contest 026 解説
 
AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説
 
AtCoder Regular Contest 031 解説
AtCoder Regular Contest 031 解説AtCoder Regular Contest 031 解説
AtCoder Regular Contest 031 解説
 
AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説AtCoder Beginner Contest 015 解説
AtCoder Beginner Contest 015 解説
 
AtCoder Regular Contest 043 解説
AtCoder Regular Contest 043 解説AtCoder Regular Contest 043 解説
AtCoder Regular Contest 043 解説
 
AtCoder Regular Contest 032 解説
AtCoder Regular Contest 032 解説AtCoder Regular Contest 032 解説
AtCoder Regular Contest 032 解説
 
AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説AtCoder Regular Contest 029 解説
AtCoder Regular Contest 029 解説
 
AtCoder Beginner Contest 025 解説
AtCoder Beginner Contest 025 解説AtCoder Beginner Contest 025 解説
AtCoder Beginner Contest 025 解説
 
Indeedなう 予選B 解説
Indeedなう 予選B 解説Indeedなう 予選B 解説
Indeedなう 予選B 解説
 
AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 

Ähnlich wie AA-sort with SSE4.1

Mongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongoDB
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming languageLincoln Hannah
 
R graphics260809
R graphics260809R graphics260809
R graphics260809lizbethfdz
 
Matrix by suman sir
Matrix by suman sirMatrix by suman sir
Matrix by suman sirsumandandal
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabkrishna_093
 
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Soumen Santra
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
Maximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityMaximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityWithTheBest
 
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
Incremental statistics for partitioned tables in 11g  by wwf from ebay COCIncremental statistics for partitioned tables in 11g  by wwf from ebay COC
Incremental statistics for partitioned tables in 11g by wwf from ebay COCLouis liu
 
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
Current Score  –  0 Due  Wednesday, November 19 2014 0400 .docxCurrent Score  –  0 Due  Wednesday, November 19 2014 0400 .docx
Current Score – 0 Due Wednesday, November 19 2014 0400 .docxfaithxdunce63732
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfahmed8651
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson
 
Reducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPReducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPgpolo
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentationManchireddy Reddy
 

Ähnlich wie AA-sort with SSE4.1 (20)

Mongodb debugging-performance-problems
Mongodb debugging-performance-problemsMongodb debugging-performance-problems
Mongodb debugging-performance-problems
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
R graphics260809
R graphics260809R graphics260809
R graphics260809
 
Matrix by suman sir
Matrix by suman sirMatrix by suman sir
Matrix by suman sir
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Time complexity
Time complexityTime complexity
Time complexity
 
Doc 20180130-wa0006
Doc 20180130-wa0006Doc 20180130-wa0006
Doc 20180130-wa0006
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
Traveling salesman problem: Game Scheduling Problem Solution: Ant Colony Opti...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Es272 ch1
Es272 ch1Es272 ch1
Es272 ch1
 
Maximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unityMaximizing performance of 3 d user generated assets in unity
Maximizing performance of 3 d user generated assets in unity
 
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
Incremental statistics for partitioned tables in 11g  by wwf from ebay COCIncremental statistics for partitioned tables in 11g  by wwf from ebay COC
Incremental statistics for partitioned tables in 11g by wwf from ebay COC
 
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
Current Score  –  0 Due  Wednesday, November 19 2014 0400 .docxCurrent Score  –  0 Due  Wednesday, November 19 2014 0400 .docx
Current Score – 0 Due Wednesday, November 19 2014 0400 .docx
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdf
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
 
Seeing Like Software
Seeing Like SoftwareSeeing Like Software
Seeing Like Software
 
Reducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSPReducing the time of heuristic algorithms for the Symmetric TSP
Reducing the time of heuristic algorithms for the Symmetric TSP
 
B61301007 matlab documentation
B61301007 matlab documentationB61301007 matlab documentation
B61301007 matlab documentation
 

Mehr von MITSUNARI Shigeo

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学MITSUNARI Shigeo
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advancedMITSUNARI Shigeo
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenMITSUNARI Shigeo
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化MITSUNARI Shigeo
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介MITSUNARI Shigeo
 

Mehr von MITSUNARI Shigeo (20)

暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advanced
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
暗認本読書会11
暗認本読書会11暗認本読書会11
暗認本読書会11
 
暗認本読書会10
暗認本読書会10暗認本読書会10
暗認本読書会10
 
暗認本読書会9
暗認本読書会9暗認本読書会9
暗認本読書会9
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
 
暗認本読書会8
暗認本読書会8暗認本読書会8
暗認本読書会8
 
暗認本読書会7
暗認本読書会7暗認本読書会7
暗認本読書会7
 
暗認本読書会6
暗認本読書会6暗認本読書会6
暗認本読書会6
 
暗認本読書会5
暗認本読書会5暗認本読書会5
暗認本読書会5
 
暗認本読書会4
暗認本読書会4暗認本読書会4
暗認本読書会4
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
私とOSSの25年
私とOSSの25年私とOSSの25年
私とOSSの25年
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
 
HPC Phys-20201203
HPC Phys-20201203HPC Phys-20201203
HPC Phys-20201203
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介
 

Kürzlich hochgeladen

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

AA-sort with SSE4.1

  • 1. AA-sort with SSE4.1 Cybozu Labs 2012/6/16 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 4(#x86opti)
  • 2. Agenda  Introduction of AA-sort  classic combsort  vectorized combsort  vectorized merge  benchmark 2012/6/16 #x86opti 4 2 /29
  • 3. AA-sort  Aligned-Access sort  proposed by Hiroshi Inoue, etc. in "A high-performance sorting algorithm for multicore single-instruction multiple-data processors," 2011  http://www.research.ibm.com/trl/people/inouehrs/SPE_SIMDsort.htm  http://www.research.ibm.com/trl/people/inouehrs/pact2007.htm  For SIMD less conditional branch, no unaligned data access  For multicore processors they implemented it for PowerPC and Cell BE  O(n log n) complexity  I tried it for Intel CPU(not complete)  https://github.com/herumi/opti/blob/master/intsort.hpp current version is for only one processor 2012/6/16 #x86opti 4 3 /29
  • 4. AA-sort  vectorized combsort for a block (<= L2cache?)  vectorized merge sorted block input array block 0 block 1 block 2 block3 ... sort sort sort sort < < < < ... merge merge < < ... merge < ... 2012/6/16 #x86opti 4 4 /29
  • 5. AA-sort algorithm  sort each block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 5 /29
  • 6. classic combsort(1/2)  improved bubble sort  unstable  O(n log n)  compare two elements having a gap(>=1) gap is divided by shrink factor (about 1.3) size_t nextGap(size_t N) { return (N * 10) / 13; } void combsort(uint32_t *a, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); } gap = nextGap(gap); } … 2012/6/16 #x86opti 4 6 /29
  • 7. classic combsort(2/2)  gap = 1 means bubble sort  loop until the array is fully sorted … for (;;) { bool isSwapped = false; for (size_t i = 0; i < N - 1; i++) { if (a[i] > a[i + 1]) { std::swap(a[i], a[i + 1]); isSwapped = true; } } if (!isSwapped) return; } } 2012/6/16 #x86opti 4 7 /29
  • 8. gap function  Combsort11  last pattern of gap [11, 8, 6, 4, 3, 2, 1] seems good by http://cs.clackamas.cc.or.us/molatore/cs260Spr03/combsort.htm size_t nextGap(size_t n) { n = (n * 10) / 13; if (n == 9 || n == 10) return 11; // (*) return n; }  a little faster if line(*) is appended 2012/6/16 #x86opti 4 8 /29
  • 9. vectorized combsort  step1 : sort values within each vector(32bitx4)  step2 : SIMD version combsort  step3 : reorder data 6 8 9 3 5 7 12 14 0 4 1 20 11 ... step1 sort sort +0 3 5 0 … … 0 1 3 … 101 +1 9 7 1 … … 102 104 105 … 380 +2 6 12 4 … … 389 391 392 … 502 +3 8 14 20 … … step2 511 515 612 … 973 v0 v1 v2 v3 step3 0 1 3 … 101 102 104 105 … 380 389 391 392 … 2012/6/16 #x86opti 4 9 /29
  • 10. step1  step1.1 : sort [v[i][j] | i<-[0..3]] for j = 0, 1, 2, 3  step1.2 : transpose 3 5 0 8 2 7 1 2 step1.1 8 12 4 13 9 14 20 15 sort v0 v1 v2 v3 0 3 5 8 step1.2 1 2 2 7 4 8 12 13 transpose 9 14 15 20 0 1 4 9 3 2 8 14 5 2 12 15 8 7 13 20 2012/6/16 #x86opti 4 10 /29
  • 11. sort of 4 items  use max ud, minud for uint32_t x 4 a b < v0 v1 v2 v3 min(a,b) max(a,b) < < min01 max01 min23 max23 < < s=max(min t=min(max min0123 max0123 01,min23) 01,max23) < min0123 min(s,t) max(s,t) max0123 sorted 2012/6/16 #x86opti 4 11 /29
  • 12. source of step1.1  V128 is a type of 32-bit integer x 4  pminud(a, b) : min(a_i, b_i) for i = 0, 1, 2, 3 void sort_step1_vec(V128 x[4]) { V128 min01 = pminud(x[0], x[1]); V128 max01 = pmaxud(x[0], x[1]); V128 min23 = pminud(x[2], x[3]); V128 max23 = pmaxud(x[2], x[3]); x[0] = pminud(min01, min23); x[3] = pmaxud(max01, max23); V128 s = pmaxud(min01, min23); V128 t = pminud(max01, max23); x[1] = pminud(s, t); x[2] = pmaxud(s, t); } 2012/6/16 #x86opti 4 12 /29
  • 13. transpose of 4x4 matrix  use unpcklps and unpckhps t0=unpcklps(x0,x2) +0 3 5 0 8 3 5 8 12 +1 2 7 1 2 t2=unpckhps(x0,x2) 0 8 4 13 +2 8 12 4 13 2 7 9 14 +3 9 14 20 15 t1=unpcklps(x1,x3) 1 2 20 15 t3=unpckhps(x1,x3) x0 x1 x2 x3 t0 t1 t2 t3 3 5 8 12 x0=unpcklps(t0,t1) 3 2 8 9 0 8 4 13 5 7 12 14 2 7 9 14 x1=unpckhps(t0,t1) 0 1 4 20 1 2 20 15 8 2 13 15 x2=unpcklps(t2,t3) t0 t1 t2 t3 x3=unpckhps(t2,t3) x0 x1 x2 x3 2012/6/16 #x86opti 4 13 /29
  • 14. source of transpose and step1 void transpose(V128 x[4]) void sort_step1(V128 *va, size_t N) { { V128 x0 = x[0]; for(size_t i = 0; i < N; i+= 4) { V128 x1 = x[1]; sort_step1_vec(&va[i]); V128 x2 = x[2]; transpose(&va[i]); V128 x3 = x[3]; } V128 t0 = unpcklps(x0, x2); } V128 t1 = unpcklps(x1, x3); V128 t2 = unpckhps(x0, x2); V128 t3 = unpckhps(x1, x3); x[0] = unpcklps(t0, t1); x[1] = unpckhps(t0, t1); x[2] = unpcklps(t2, t3); x[3] = unpckhps(t2, t3); } 2012/6/16 #x86opti 4 14 /29
  • 15. SIMD version combsort  first half code use  vector_cmpswap  vector_cmpswap_skew bool sort_step2(V128 *va, size_t N) { size_t gap = nextGap(N); while (gap > 1) { for (size_t i = 0; i < N - gap; i++) { vector_cmpswap(va[i], va[i + gap]); } for (size_t i = N - gap; i < N; i++) { vector_cmpswap_skew(va[i], va[i + gap - N]); } gap = nextGap(gap); } ... 2012/6/16 #x86opti 4 15 /29
  • 16. vector_cmpswap  no conditional branch a b < min(a,b) max(a,b) if (a[i] > a[i + gap]) std::swap(a[i], a[i + gap]); vectorised void vector_cmpswap(V128& a, V128& b) { V128 t = pmaxud(a, b); a = pminud(a, b); b = t; } 2012/6/16 #x86opti 4 16 /29
  • 17. vector_cmpswap_skew  for boundary of array a a3 a2 a1 a0 b b3 b2 b1 b0 (a',b') = vector_cmpswap_ske(a,b) a' a3 min(a2,b3) min(a1,b2) min(a0,b1) b' max(a2,b3) max(a1,b2) max(a0,b1) b0 2012/6/16 #x86opti 4 17 /29
  • 18. isSortedVec  check whether array is sorted  ptest_zf(a, b) is true if (a & b) == 0  a <= b  max(a,b) == b  c := max(a,b) – b == 0  pcmpgtd is for int32_t, so we can't use it bool isSortedVec(const V128 *va, size_t N) { for (size_t i = 0; i < N - 1; i++) { V128 a = va[i]; V128 b = va[i + 1]; V128 c = pmaxud(a, b); c = psubd(c, b); if (!ptest_zf(c, c)) { return false; } } return true; } 2012/6/16 #x86opti 4 18 /29
  • 19. loop for gap == 1  vectorised bubble sort for gap == 1  retire if loop count reaches maxLoop fall to std::sort  almost rare const int maxLoop = 10; for (int i = 0; i < maxLoop; i++) { for (size_t i = 0; i < N - 1; i++) { vector_cmpswap(va[i], va[i + 1]); } vector_cmpswap_skew(va[N - 1], va[0]); if (isSortedVec(va, N)) return true; } 2012/6/16 #x86opti 4 19 /29
  • 20. AA-sort algorithm  sort each block  O(n log n)  merge sorted block  O(n) 2012/6/16 #x86opti 4 20 /29
  • 21. merge two sorted vector  a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are soreted  c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c0 c1 c2 c3 sorted 2012/6/16 #x86opti 4 21 /29
  • 22. data flow of merge sorted sorted a0 a1 a2 a3 b0 b1 b2 b3 < < < < min00 max00 min11 max11 min22 max22 min33 max33 < < < < < 2012/6/16 #x86opti 4 22 /29
  • 23. source of vector_merge  Too complex  good idea? void vector_merge(V128& a, V128& b) { V128 m = pminud(a, b); V128 M = pmaxud(a, b); V128 s0 = punpckhqdq(m, m); V128 s1 = pminud(s0, M); V128 s2 = pmaxud(s0, M); V128 s3 = punpcklqdq(s1, punpckhqdq(M, M)); V128 s4 = punpcklqdq(s2, m); s4 = pshufd<PACK(2, 1, 0, 3)>(s4); V128 s5 = pminud(s3, s4); V128 s6 = pmaxud(s3, s4); V128 s7 = pinsrd<2>(s5, movd(s6)); V128 s8 = pinsrd<0>(s6, pextrd<2>(s5)); a = pshufd<PACK(1, 2, 0, 3)>(s7); b = pshufd<PACK(3, 2, 0, 1)>(s8); } 2012/6/16 #x86opti 4 23 /29
  • 24. std::merge()  merge [begin1, end1) and [begin2, end2) template <class In1, class In2, class Out> Out merge(In1 begin1, In1 end1, In2 begin2, In2 end2, Out out) { for (;;) { *out++ = *begin2 < *begin1 ? *begin2++ : *begin1++; if (begin1 == end1) return copy(begin2, end2, result); if (begin2 == end2) return copy(begin1, end1, result); } } 2012/6/16 #x86opti 4 24 /29
  • 25. vectorised merge  merge arrays with vector_merge() void merge(V128 *vo, const V128 *va, size_t aN, const V128 *vb, size_t bN){ uint32_t aPos = 0, bPos = 0, outPos = 0; V128 vMin = va[aPos++]; V128 vMax = vb[bPos++]; for (;;) { vector_merge(vMin, vMax); vo[outPos++] = vMin; if (aPos < aN) { if (bPos < bN) { V128 ta = va[aPos]; V128 tb = vb[bPos]; ; compare ta0 with tb0 if (movd(ta) <= movd(tb)) { vMin = ta; aPos++; } else { vMin = tb; bPos++; } 2012/6/16 #x86opti 4 25 /29
  • 26. block size and rate of sort  What is good size for vectorised sort?  half size of L2 is recommended for PowerPC 970MP L2 = 1MiB => 512KiB => block size = 128Ki / uint32_t  BS = 32Ki seems good for Xeon, Core i7  profile of sort and merge 100 80 60 40 merge(%) 20 sort(%) 0 2012/6/16 #x86opti 4 26 /29
  • 27. Benchmark(1/3)  AA-sort vs std::sort for random data  Xeon X5650 + gcc-4.6.3 4 times faster for # < 64Ki, 2.85 times faster for # is 4Mi 10000000 std::sort fast 1000000 AA-sort 100000 clock cycle 10000 1000 100 10 1 16 64 256 1Ki 4Ki 16Ki 64Ki 256Ki 1Mi 4Mi # of uint32_t 2012/6/16 #x86opti 4 27 /29
  • 28. Benchmark(2/3)  sort 64Ki uint on Xeon + gcc-4.6.3  AA-sort speed does not strongly depend on pattern 25000 fast 20000 std::sort 15000 AA-sort 10000 5000 0 2012/6/16 #x86opti 4 28 /29
  • 29. Benchmark(3/3)  sort 64Ki uint on Core i7 + gcc-4.6.3 / VC11 16000 fast 14000 12000 10000 std::sort(gcc) 8000 AA-sort(gcc) 6000 std::sort(VC) 4000 AA-sort(VC) 2000 0 2012/6/16 #x86opti 4 29 /29