Wednesday, March 22, 2017

Lab 6 - Auto-Vectorization with gcc

This is an exercise on Auto-Vectorization. The purpose is to analyse a simple source code's compilation through its disassembly and modify the source in order to aid the compiler in making a more efficient binary.

The source was created by following these guidelines. The following functions take place in the main file and are responsible for adding pairs of array elements and saving into a third array, and adding all elements of an array into a single value, respectively.


//adds pairs of array elements
void addArr(int* p3, int* p1, int* p2, int length){

    int i;
    
    for(i=0; i<length; i++)
        p3[i] = p1[i] + p2[i];
        
}

//adds elements of an int array
long addElems(int *arr , int length){

    int i;
    long sum = 0;

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}


By compiling the main file on an Aarch64 system with no compiler options, we get the following disassembly for the first function:


0000000000400660 <addArr>:
  400660:       d100c3ff        sub     sp, sp, #0x30 --> Creating Stack
  400664:       f9000fe0        str     x0, [sp,#24] --> Saving p3 pointer
  400668:       f9000be1        str     x1, [sp,#16] --> Saving p1 pointer
  40066c:       f90007e2        str     x2, [sp,#8] --> Saving p2 pointer
  400670:       b90007e3        str     w3, [sp,#4] --> Saving length int
  400674:       b9002fff        str     wzr, [sp,#44] ---> assigning zero to i 
  400678:       14000014        b       4006c8 <addArr+0x68> --->Start of loop
  40067c:       b9802fe0        ldrsw   x0, [sp,#44] ---> load i into r0
  400680:       d37ef400        lsl     x0, x0, #2 --> Quadruples i (int = 4 bytes)
  400684:       f9400fe1        ldr     x1, [sp,#24] --> loads p3 (pointer) into r1
  400688:       8b000020        add     x0, x1, x0 --> offsets p3 pointer by i, save in r0
  40068c:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  400690:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  400694:       f9400be2        ldr     x2, [sp,#16] --> load p1 (pointer) into r2
  400698:       8b010041        add     x1, x2, x1 --> offsets p1 pointer by i, save in r1
  40069c:       b9400022        ldr     w2, [x1] --> load p1 int value into r2
  4006a0:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  4006a4:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  4006a8:       f94007e3        ldr     x3, [sp,#8] --> load p2 (pointer) into r3
  4006ac:       8b010061        add     x1, x3, x1 --> offsets p2 pointer by i, save in r1
  4006b0:       b9400021        ldr     w1, [x1] --> load p2 int value into r1
  4006b4:       0b010041        add     w1, w2, w1 --> finally adds pair of elements, save in r1
  4006b8:       b9000001        str     w1, [x0] --> store into p3 array
  4006bc:       b9402fe0        ldr     w0, [sp,#44] \
  4006c0:       11000400        add     w0, w0, #0x1  --> Inc i
  4006c4:       b9002fe0        str     w0, [sp,#44] /
  4006c8:       b9402fe1        ldr     w1, [sp,#44] ---> loads i into r1
  4006cc:       b94007e0        ldr     w0, [sp,#4] ---> loads length int into r0
  4006d0:       6b00003f        cmp     w1, w0  ---> i - length
  4006d4:       54fffd4b        b.lt    40067c <addArr+0x1c> ---> if negative, loop again
  4006d8:       9100c3ff        add     sp, sp, #0x30 ---> restores stack pointer
  4006dc:       d65f03c0        ret   ---> return to caller

As we can see from the code, there is no vectorization happening. The function is simply getting one addition value per loop. Also, we notice that the program is using the registers very conservatively. It stores all values into RAM and never gets to use the r4 register, in order to "play safe" in relation to other calls in the program. Using RAM is extremely slow to the CPU.

Now using the '-O1' compiler option:

0000000000400660 <addArr>:
  400660:       6b1f007f        cmp     w3, wzr  --> w3 originally has 'length'
  400664:       5400018d        b.le    400694 <addArr+0x34>
  400668:       51000466        sub     w6, w3, #0x1
  40066c:       910004c6        add     x6, x6, #0x1
  400670:       d37ef4c6        lsl     x6, x6, #2
  400674:       d2800003        mov     x3, #0x0                        // #0
  400678:       b8636825        ldr     w5, [x1,x3]
  40067c:       b8636844        ldr     w4, [x2,x3] --> r1 and r2 have the pointers to p1 and p2
  400680:       0b0400a4        add     w4, w5, w4
  400684:       b8236804        str     w4, [x0,x3] --> r0 has pointer to p3
  400688:       91001063        add     x3, x3, #0x4
  40068c:       eb06007f        cmp     x3, x6
  400690:       54ffff41        b.ne    400678 <addArr+0x18>
  400694:       d65f03c0        ret


We can see that it processes the results in a more direct way and minimally uses the RAM, while not allocating any stack. Still, the compiler wouldn't vectorize the code, as auto-vectorization is not enabled. In order to enable it, we either need to use -O3, which bundles various flags, or just by adding the single '-ftree-vectorize' flag when compiling.

Let's just try using the single flag, while on -O1:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 540007ad  b.le 400758 <addArr+0xf8>
  400668: 91004004  add x4, x0, #0x10
  40066c: eb04003f  cmp x1, x4
  400670: 1a9f37e6  cset w6, cs
  400674: 91004025  add x5, x1, #0x10
  400678: eb05001f  cmp x0, x5
  40067c: 1a9f37e5  cset w5, cs
  400680: 2a0500c5  orr w5, w6, w5
  400684: eb04005f  cmp x2, x4
  400688: 1a9f37e6  cset w6, cs
  40068c: 91004044  add x4, x2, #0x10
  400690: eb04001f  cmp x0, x4
  400694: 1a9f37e4  cset w4, cs
  400698: 2a0400c4  orr w4, w6, w4
  40069c: 6a0400bf  tst w5, w4
  4006a0: 54000460  b.eq 40072c <addArr+0xcc>
  4006a4: 71000c7f  cmp w3, #0x3
  4006a8: 54000429  b.ls 40072c <addArr+0xcc>
  4006ac: 53027c66  lsr w6, w3, #2
  4006b0: 531e74c7  lsl w7, w6, #2
  4006b4: 34000387  cbz w7, 400724 <addArr+0xc4>
  4006b8: d2800004  mov x4, #0x0                    // #0
  4006bc: 2a0403e5  mov w5, w4
  4006c0: 8b040028  add x8, x1, x4
  4006c4: 4c407901  ld1 {v1.4s}, [x8]
  4006c8: 8b040048  add x8, x2, x4
  4006cc: 4c407900  ld1 {v0.4s}, [x8]
  4006d0: 4ea08420  add v0.4s, v1.4s, v0.4s --> Vectorization!
  4006d4: 8b040008  add x8, x0, x4
  4006d8: 4c007900  st1 {v0.4s}, [x8]
  4006dc: 110004a5  add w5, w5, #0x1
  4006e0: 91004084  add x4, x4, #0x10
  4006e4: 6b0600bf  cmp w5, w6
  4006e8: 54fffec3  b.cc 4006c0 <addArr+0x60>
  4006ec: 1400000a  b 400714 <addArr+0xb4>
  4006f0: 937e7c85  sbfiz x5, x4, #2, #32
  4006f4: b8656827  ldr w7, [x1,x5]
  4006f8: b8656846  ldr w6, [x2,x5]
  4006fc: 0b0600e6  add w6, w7, w6
  400700: b8256806  str w6, [x0,x5]
  400704: 11000484  add w4, w4, #0x1
  400708: 6b04007f  cmp w3, w4
  40070c: 54ffff2c  b.gt 4006f0 <addArr+0x90>
  400710: 14000012  b 400758 <addArr+0xf8>
  400714: 2a0703e4  mov w4, w7
  400718: 6b07007f  cmp w3, w7
  40071c: 54fffea1  b.ne 4006f0 <addArr+0x90>
  400720: 1400000e  b 400758 <addArr+0xf8>
  400724: 52800004  mov w4, #0x0                    // #0
  400728: 17fffff2  b 4006f0 <addArr+0x90>
  40072c: 51000466  sub w6, w3, #0x1
  400730: 910004c6  add x6, x6, #0x1
  400734: d37ef4c6  lsl x6, x6, #2
  400738: d2800003  mov x3, #0x0                    // #0
  40073c: b8636825  ldr w5, [x1,x3]
  400740: b8636844  ldr w4, [x2,x3]
  400744: 0b0400a4  add w4, w5, w4
  400748: b8236804  str w4, [x0,x3]
  40074c: 91001063  add x3, x3, #0x4
  400750: eb06007f  cmp x3, x6
  400754: 54ffff41  b.ne 40073c <addArr+0xdc>
  400758: d65f03c0  ret


We finally get vectorized operations! Although it does an awful lot of comparisons, probably to check alignment, resulting in this 64-line code.

As most modern processors align variables, we can pretty much tell the compiler to assume that they are, in fact, aligned. Following some techniques in this article, our function now looks like this:


addArr(int *__restrict p3, int *__restrict p1, int *__restrict p2, int length){
    int i;
    int *arr3 = __builtin_assume_aligned(p3, 16);
    int *arr2 = __builtin_assume_aligned(p2, 16);
    int *arr1 = __builtin_assume_aligned(p1, 16);

    for(i=0; i<length; i++)
        arr3[i] = arr1[i] + arr2[i];
}

The '__builtin_assume_aligned' function is hinting the compiler that the values are at least 16-byte aligned, and the '__restrict' keyword is telling it that the pointers are not overlapping into different regions of memory at runtime. All this allows the compiler to perform some optimizations, that result in this shorter code:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 5400046d  b.le 4006f0 <addArr+0x90>
  400668: 53027c66  lsr w6, w3, #2
  40066c: 531e74c5  lsl w5, w6, #2
  400670: 340003c5  cbz w5, 4006e8 <addArr+0x88>
  400674: 71000c7f  cmp w3, #0x3
  400678: 54000389  b.ls 4006e8 <addArr+0x88>
  40067c: d2800004  mov x4, #0x0                    // #0
  400680: 2a0403e7  mov w7, w4
  400684: 8b040008  add x8, x0, x4
  400688: 8b04002a  add x10, x1, x4
  40068c: 8b040049  add x9, x2, x4
  400690: 4c407941  ld1 {v1.4s}, [x10]
  400694: 4c407920  ld1 {v0.4s}, [x9]
  400698: 4ea08420  add v0.4s, v1.4s, v0.4s    ---> Still vectorizing!
  40069c: 4c007900  st1 {v0.4s}, [x8]
  4006a0: 110004e7  add w7, w7, #0x1
  4006a4: 91004084  add x4, x4, #0x10
  4006a8: 6b0700df  cmp w6, w7
  4006ac: 54fffec8  b.hi 400684 <addArr+0x24>
  4006b0: 1400000a  b 4006d8 <addArr+0x78>
  4006b4: 937e7c85  sbfiz x5, x4, #2, #32
  4006b8: b8656827  ldr w7, [x1,x5]
  4006bc: b8656846  ldr w6, [x2,x5]
  4006c0: 0b0600e6  add w6, w7, w6
  4006c4: b8256806  str w6, [x0,x5]
  4006c8: 11000484  add w4, w4, #0x1
  4006cc: 6b04007f  cmp w3, w4
  4006d0: 54ffff2c  b.gt 4006b4 <addArr+0x54>
  4006d4: 14000007  b 4006f0 <addArr+0x90>
  4006d8: 2a0503e4  mov w4, w5
  4006dc: 6b05007f  cmp w3, w5
  4006e0: 54fffea1  b.ne 4006b4 <addArr+0x54>
  4006e4: 14000003  b 4006f0 <addArr+0x90>
  4006e8: 52800004  mov w4, #0x0                    // #0
  4006ec: 17fffff2  b 4006b4 <addArr+0x54>
  4006f0: d65f03c0  ret

Applying the same techniques to the 'addElems' function:


long addElems(int *__restrict p_arr , int length){

    int i;
    long sum = 0;
    int *arr = __builtin_assume_aligned(p_arr, 16);

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}

And we also get a vectorized and relatively short assembly code.

Conclusion

As GCC supports many different processors and architectures, it needs to maintain stability between these different platforms. This results in a lot of the optimizations not taking place automatically and leaves a lot of it up to the programmer to figure out.

Using vectorization can grant huge performance boosts and there are a myriad of existing software that don't take advantage of it, sometimes opting for multi-threaded solutions instead. With that said, there's a great untouched potential to improve these software with vectorization and that might lead for better auto-vectorization support on part of the compiler, and CPU manufacturers.


No comments:

Post a Comment