The source was created by following these guidelines. The following functions take place in the main file and are responsible for adding pairs of array elements and saving into a third array, and adding all elements of an array into a single value, respectively.
//adds pairs of array elements
void addArr(int* p3, int* p1, int* p2, int length){
int i;
for(i=0; i<length; i++)
p3[i] = p1[i] + p2[i];
}
//adds elements of an int array
long addElems(int *arr , int length){
int i;
long sum = 0;
for (i=0; i<length; i++)
sum += arr[i];
return sum;
}
By compiling the main file on an Aarch64 system with no compiler options, we get the following disassembly for the first function:
0000000000400660 <addArr>:
400660: d100c3ff sub sp, sp, #0x30 --> Creating Stack
400664: f9000fe0 str x0, [sp,#24] --> Saving p3 pointer
400668: f9000be1 str x1, [sp,#16] --> Saving p1 pointer
40066c: f90007e2 str x2, [sp,#8] --> Saving p2 pointer
400670: b90007e3 str w3, [sp,#4] --> Saving length int
400674: b9002fff str wzr, [sp,#44] ---> assigning zero to i
400678: 14000014 b 4006c8 <addArr+0x68> --->Start of loop
40067c: b9802fe0 ldrsw x0, [sp,#44] ---> load i into r0
400680: d37ef400 lsl x0, x0, #2 --> Quadruples i (int = 4 bytes)
400684: f9400fe1 ldr x1, [sp,#24] --> loads p3 (pointer) into r1
400688: 8b000020 add x0, x1, x0 --> offsets p3 pointer by i, save in r0
40068c: b9802fe1 ldrsw x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
400690: d37ef421 lsl x1, x1, #2 --> quadruples i (again)
400694: f9400be2 ldr x2, [sp,#16] --> load p1 (pointer) into r2
400698: 8b010041 add x1, x2, x1 --> offsets p1 pointer by i, save in r1
40069c: b9400022 ldr w2, [x1] --> load p1 int value into r2
4006a0: b9802fe1 ldrsw x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
4006a4: d37ef421 lsl x1, x1, #2 --> quadruples i (again)
4006a8: f94007e3 ldr x3, [sp,#8] --> load p2 (pointer) into r3
4006ac: 8b010061 add x1, x3, x1 --> offsets p2 pointer by i, save in r1
4006b0: b9400021 ldr w1, [x1] --> load p2 int value into r1
4006b4: 0b010041 add w1, w2, w1 --> finally adds pair of elements, save in r1
4006b8: b9000001 str w1, [x0] --> store into p3 array
4006bc: b9402fe0 ldr w0, [sp,#44] \
4006c0: 11000400 add w0, w0, #0x1 --> Inc i
4006c4: b9002fe0 str w0, [sp,#44] /
4006c8: b9402fe1 ldr w1, [sp,#44] ---> loads i into r1
4006cc: b94007e0 ldr w0, [sp,#4] ---> loads length int into r0
4006d0: 6b00003f cmp w1, w0 ---> i - length
4006d4: 54fffd4b b.lt 40067c <addArr+0x1c> ---> if negative, loop again
4006d8: 9100c3ff add sp, sp, #0x30 ---> restores stack pointer
4006dc: d65f03c0 ret ---> return to caller
As we can see from the code, there is no vectorization happening. The function is simply getting one addition value per loop. Also, we notice that the program is using the registers very conservatively. It stores all values into RAM and never gets to use the r4 register, in order to "play safe" in relation to other calls in the program. Using RAM is extremely slow to the CPU.
Now using the '-O1' compiler option:
0000000000400660 <addArr>:
400660: 6b1f007f cmp w3, wzr --> w3 originally has 'length'
400664: 5400018d b.le 400694 <addArr+0x34>
400668: 51000466 sub w6, w3, #0x1
40066c: 910004c6 add x6, x6, #0x1
400670: d37ef4c6 lsl x6, x6, #2
400674: d2800003 mov x3, #0x0 // #0
400678: b8636825 ldr w5, [x1,x3]
40067c: b8636844 ldr w4, [x2,x3] --> r1 and r2 have the pointers to p1 and p2
400680: 0b0400a4 add w4, w5, w4
400684: b8236804 str w4, [x0,x3] --> r0 has pointer to p3
400688: 91001063 add x3, x3, #0x4
40068c: eb06007f cmp x3, x6
400690: 54ffff41 b.ne 400678 <addArr+0x18>
400694: d65f03c0 ret
We can see that it processes the results in a more direct way and minimally uses the RAM, while not allocating any stack. Still, the compiler wouldn't vectorize the code, as auto-vectorization is not enabled. In order to enable it, we either need to use -O3, which bundles various flags, or just by adding the single '-ftree-vectorize' flag when compiling.
Let's just try using the single flag, while on -O1:
0000000000400660 <addArr>:
400660: 6b1f007f cmp w3, wzr
400664: 540007ad b.le 400758 <addArr+0xf8>
400668: 91004004 add x4, x0, #0x10
40066c: eb04003f cmp x1, x4
400670: 1a9f37e6 cset w6, cs
400674: 91004025 add x5, x1, #0x10
400678: eb05001f cmp x0, x5
40067c: 1a9f37e5 cset w5, cs
400680: 2a0500c5 orr w5, w6, w5
400684: eb04005f cmp x2, x4
400688: 1a9f37e6 cset w6, cs
40068c: 91004044 add x4, x2, #0x10
400690: eb04001f cmp x0, x4
400694: 1a9f37e4 cset w4, cs
400698: 2a0400c4 orr w4, w6, w4
40069c: 6a0400bf tst w5, w4
4006a0: 54000460 b.eq 40072c <addArr+0xcc>
4006a4: 71000c7f cmp w3, #0x3
4006a8: 54000429 b.ls 40072c <addArr+0xcc>
4006ac: 53027c66 lsr w6, w3, #2
4006b0: 531e74c7 lsl w7, w6, #2
4006b4: 34000387 cbz w7, 400724 <addArr+0xc4>
4006b8: d2800004 mov x4, #0x0 // #0
4006bc: 2a0403e5 mov w5, w4
4006c0: 8b040028 add x8, x1, x4
4006c4: 4c407901 ld1 {v1.4s}, [x8]
4006c8: 8b040048 add x8, x2, x4
4006cc: 4c407900 ld1 {v0.4s}, [x8]
4006d0: 4ea08420 add v0.4s, v1.4s, v0.4s --> Vectorization!
4006d4: 8b040008 add x8, x0, x4
4006d8: 4c007900 st1 {v0.4s}, [x8]
4006dc: 110004a5 add w5, w5, #0x1
4006e0: 91004084 add x4, x4, #0x10
4006e4: 6b0600bf cmp w5, w6
4006e8: 54fffec3 b.cc 4006c0 <addArr+0x60>
4006ec: 1400000a b 400714 <addArr+0xb4>
4006f0: 937e7c85 sbfiz x5, x4, #2, #32
4006f4: b8656827 ldr w7, [x1,x5]
4006f8: b8656846 ldr w6, [x2,x5]
4006fc: 0b0600e6 add w6, w7, w6
400700: b8256806 str w6, [x0,x5]
400704: 11000484 add w4, w4, #0x1
400708: 6b04007f cmp w3, w4
40070c: 54ffff2c b.gt 4006f0 <addArr+0x90>
400710: 14000012 b 400758 <addArr+0xf8>
400714: 2a0703e4 mov w4, w7
400718: 6b07007f cmp w3, w7
40071c: 54fffea1 b.ne 4006f0 <addArr+0x90>
400720: 1400000e b 400758 <addArr+0xf8>
400724: 52800004 mov w4, #0x0 // #0
400728: 17fffff2 b 4006f0 <addArr+0x90>
40072c: 51000466 sub w6, w3, #0x1
400730: 910004c6 add x6, x6, #0x1
400734: d37ef4c6 lsl x6, x6, #2
400738: d2800003 mov x3, #0x0 // #0
40073c: b8636825 ldr w5, [x1,x3]
400740: b8636844 ldr w4, [x2,x3]
400744: 0b0400a4 add w4, w5, w4
400748: b8236804 str w4, [x0,x3]
40074c: 91001063 add x3, x3, #0x4
400750: eb06007f cmp x3, x6
400754: 54ffff41 b.ne 40073c <addArr+0xdc>
400758: d65f03c0 ret
We finally get vectorized operations! Although it does an awful lot of comparisons, probably to check alignment, resulting in this 64-line code.
As most modern processors align variables, we can pretty much tell the compiler to assume that they are, in fact, aligned. Following some techniques in this article, our function now looks like this:
addArr(int *__restrict p3, int *__restrict p1, int *__restrict p2, int length){
int i;
int *arr3 = __builtin_assume_aligned(p3, 16);
int *arr2 = __builtin_assume_aligned(p2, 16);
int *arr1 = __builtin_assume_aligned(p1, 16);
for(i=0; i<length; i++)
arr3[i] = arr1[i] + arr2[i];
}
The '__builtin_assume_aligned' function is hinting the compiler that the values are at least 16-byte aligned, and the '__restrict' keyword is telling it that the pointers are not overlapping into different regions of memory at runtime. All this allows the compiler to perform some optimizations, that result in this shorter code:
0000000000400660 <addArr>:
400660: 6b1f007f cmp w3, wzr
400664: 5400046d b.le 4006f0 <addArr+0x90>
400668: 53027c66 lsr w6, w3, #2
40066c: 531e74c5 lsl w5, w6, #2
400670: 340003c5 cbz w5, 4006e8 <addArr+0x88>
400674: 71000c7f cmp w3, #0x3
400678: 54000389 b.ls 4006e8 <addArr+0x88>
40067c: d2800004 mov x4, #0x0 // #0
400680: 2a0403e7 mov w7, w4
400684: 8b040008 add x8, x0, x4
400688: 8b04002a add x10, x1, x4
40068c: 8b040049 add x9, x2, x4
400690: 4c407941 ld1 {v1.4s}, [x10]
400694: 4c407920 ld1 {v0.4s}, [x9]
400698: 4ea08420 add v0.4s, v1.4s, v0.4s ---> Still vectorizing!
40069c: 4c007900 st1 {v0.4s}, [x8]
4006a0: 110004e7 add w7, w7, #0x1
4006a4: 91004084 add x4, x4, #0x10
4006a8: 6b0700df cmp w6, w7
4006ac: 54fffec8 b.hi 400684 <addArr+0x24>
4006b0: 1400000a b 4006d8 <addArr+0x78>
4006b4: 937e7c85 sbfiz x5, x4, #2, #32
4006b8: b8656827 ldr w7, [x1,x5]
4006bc: b8656846 ldr w6, [x2,x5]
4006c0: 0b0600e6 add w6, w7, w6
4006c4: b8256806 str w6, [x0,x5]
4006c8: 11000484 add w4, w4, #0x1
4006cc: 6b04007f cmp w3, w4
4006d0: 54ffff2c b.gt 4006b4 <addArr+0x54>
4006d4: 14000007 b 4006f0 <addArr+0x90>
4006d8: 2a0503e4 mov w4, w5
4006dc: 6b05007f cmp w3, w5
4006e0: 54fffea1 b.ne 4006b4 <addArr+0x54>
4006e4: 14000003 b 4006f0 <addArr+0x90>
4006e8: 52800004 mov w4, #0x0 // #0
4006ec: 17fffff2 b 4006b4 <addArr+0x54>
4006f0: d65f03c0 ret
Applying the same techniques to the 'addElems' function:
long addElems(int *__restrict p_arr , int length){
int i;
long sum = 0;
int *arr = __builtin_assume_aligned(p_arr, 16);
for (i=0; i<length; i++)
sum += arr[i];
return sum;
}
And we also get a vectorized and relatively short assembly code.
Conclusion
As GCC supports many different processors and architectures, it needs to maintain stability between these different platforms. This results in a lot of the optimizations not taking place automatically and leaves a lot of it up to the programmer to figure out.Using vectorization can grant huge performance boosts and there are a myriad of existing software that don't take advantage of it, sometimes opting for multi-threaded solutions instead. With that said, there's a great untouched potential to improve these software with vectorization and that might lead for better auto-vectorization support on part of the compiler, and CPU manufacturers.
No comments:
Post a Comment