Thursday, March 30, 2017

Lab 7 - Brief LAME assembler usage

The LAME is an MP3 encoder, despite the originally assigned recursive acronym standing for 'LAME Ain't an MP3 Encoder'. It claims to be an educational too for learning about MP3 encoding and it has a goal of improving the quality and performance of it. Interestingly, LAME is distributed by the developers only as source code with the purpose of being used merely as an educational tool, which in turn avoids patent infringement. Still, the developers advise acquiring a license before deploying projects implementing any part of compiled LAME code.



MP3

MP3 (aka MPEG-1, MPEG-2 Audio Layer III) is a digital audio format that reduces the size of original audio files by using lossy audio compression algorithms. The format takes advantage of the frequency range in the limited human sound perception (psychoacoustic model) and frequency resolution, i.e noticing changes in frequency. The compression discards or reduces those inaudible components.

The MPEG-1 standard provides example psychoactive models among other specifications, but doesn't provide precise specifications for the encoder, leaving developers to create their own algorithms to remove undesired information from a sound file. This led to the creation of many MP3 encoders, each generating outputs of different quality. For example, some encoders were better at dealing with low bit-rate files, while others (e.g LAME) were better at higher bit-rates. Eventually, LAME evolved and became widely used.

Encoding

Encoding MP3 requires dealing a lot with frequency manipulation involving many samples, and ideally being able to process those samples as fast as possible. On software, this can be done by implementing algorithms like fast Fourier Transform which converts a signal represented in the time domain to the frequency domain, allowing a much more convenient way to analyse manipulate data. Once there is a concrete basis theory that can be used with discrete data, the next level of optimization is to explore the hardware features available to reduce the time taken to process it. LAME is able to make use of hardware features to optimize encoding.

Assembly Usage

LAME extensively uses assembly. It specifically targets x86_64 architecture and can make use of features such as MMX, SSE, SSE2 and '3DNow!'. Although there's very little inline assembly, there are various nasm files that state functions which are called in C code.

The assembly code is used for checking cpu features, performing fast Fourier transforms using different features (depending on what's available) and other discrete calculus operations.

I couldn't compile the source to Aarch64.

Conclusion

LAME has been in development for a long time and it's widely used in many OS's. Their website says: "LAME compiles on Windows, DOS, GNU/Linux, MacOS X, *BSD, Solaris, HP-UX, Tru64 Unix, AIX, Irix, NeXTstep, SCO Unix, UnixWare, Ultrix, Plan 9, OpenVMS, MacOS Classic, BeOS, QNX, RiscOS, AmigaOS, OS/2, SkyOS, FreeMiNT(Atari) and probably a few more". It is really a project that has been optimized and ported a lot, which is a benefit for the user, but it immensely increases the complexity when compiling the code. Furthermore, the assembly code limits the usage of the program's optimizations on different platforms, in which equivalent features may be available.

Lab 7 - Inline A64 Assembler

This lab will explore the benefits of using inline assembly specifically on the Aarch64. If writing assembly was as easy as writing C (and for some people is), C would be obsolete. Thankfully, C is able to provide abstraction so we don't have to deal with the processor specific instructions, and makes code portable between processors, but there are times that it just can't make full use of the processor's features that could significantly improve the performance of a certain process. At least not without a lot of tinkering to make the compiler understand that it should use those features. The most straight forward way to use those features, while having the abstraction of C, is to use inline assembly, which just embeds architecture specific assembly code within the C source code.

For demonstrating the implementation of inline assembly, I'll be optimizing the code from lab 5 where 500 million 16-bit sound samples were modified by a volume scaling factor. The final solution I had come up with was the following:


 float vol = 0.333;  
   
 // converting float into uint16_t: 1 whole, 15 decimals  
 uint16_t vol16 = (int)(vol * pow(2,15));  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 for (i = 0; i < INPUT_NUM ; i++)   
   input[i] = ((input[i] * vol16) >> 15);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////  

This solution involves converting the float volume scaling factor into a fixed point format and then performing the multiplication. The results for compiling without any options for 500 million samples was very close to 4 seconds (3959 ms), and when compiling with the -O3 option, the time was cut down to 362ms.

Implementing inline assembly

I decided to create a more general function to introduce the inline assembly into the code. The code block above turned into this:

 float vol = 0.333;  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 adjustVolume(input, vol, INPUT_NUM);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////  

And the final function is as follows:

 void adjustVolume(int16_t* samples, float vsf, unsigned long size){  
   
   unsigned long nli;  
   int samples_left;  
   int16_t* i;  
   
   //convert float to UQ1.15 fixed point  
   uint16_t vol16 = (int)(vsf * pow(2,15));  
   
   //see if there's left over values  
   samples_left = size % 8;//8 samples at a time   
   
   //loop until before non loadable index  
   nli = size - samples_left;  
   int16_t* p_nli = &samples[nli];  
   
   for(i = samples; i < p_nli ; i+=8){ //8 samples   
   
     __asm__("DUP v1.8h, %w[volume]\n\t" //load volume into v1  
         "LD1 {v0.8h}, [%[smp_ptr]]\n\t" //load 8 16-bit samples in v0  
         "SQDMULH v0.8h, v0.8h, v1.8h\n\t" //adjust 8 samples at a time; save in v0  
         "ST1 {v0.8h}, [%[smp_ptr]]\n\t" //store back into samples array  
   
         :  
   
         : [volume] "r" (vol16), [smp_ptr]"r"(i)//input  
   
         :  
   
         );  
   
   }  
   
   while(samples_left){  
   
   
     samples[size-samples_left-1] = ((samples[size-samples_left-1] * vol16) >> 15);  
   
     samples_left--;  
   
   }  
   
 }  

The function starts by converting the volume scaling factor float into a fixed point, for making the load of the value more convenient. Getting into the assembly, we give the converted 16-bit volume and the pointer to the array that increments by 8 16-bit elements each loop. Dup and LD1 loads the volume and sample pointer into different vectors. SQDMULH multiplies those vectors, which creates a 32-bit value, and overwrites the samples vector with the high half of the result. ST1 then stores the adjusted samples into the array in allocated in memory.

To have a better idea of what is going on, here's a visualization using gdb:


Commands:
  • ni: next instruction
  • x/h $[register]: show contents in memory (x) pointed by value in register, format as half-word (h)
  • p $[register]: print value in register. I the case above, only showing half-word format
At the top, we can see that the inline assembly remained intact on compilation. On the commands, we can see a value of the array in memory and then the same value printed as a signed integer in the vector register v0. v1 holds the fixed point volume scaling factor (10911), which is equivalent to 0.333 (getting close to, at least). After executing the multiplication, we see that v0 now holds the adjusted values, and one instruction further, those are stored in memory.

Wrapping the function, is a loop controlled by an int16_t pointer that skips 8 elements each loop, and exits when there aren't 8 elements to be loaded. The function checks for such cases beforehand, to minimize comparisons in each loop. Worst cases would be if size % 8 = 7, and those last samples are individually adjusted after the main process.

Results!

Finally, this section compares the results from the implementation on lab 5, simply using the fixed point implementation on 500 million samples, to this lab's implementation, using inline assembly on 500 000 007 samples, just to get the worst case possible. Although, those last samples make almost no difference to the result.


So, huge improvement using the inline assembly with no compiler options, although no improvement on the optimized compilation. This is probably due to the extra overhead in the function, and maybe those extra samples made a dent after all, but that was to be expected, since -O3 enables the auto-vectorization option. Still, compiling the inline assembly code with -O1 already gives results very close to the -O3 option (372ms), while the older version at -O1 gives results around 760ms.

Conclusion

Using inline assembly allows us to make bare metal optimizations for the parts where performance really matters. For single operations, there's no real need to use assembly, so we can enjoy the abstraction C provides for most of the tasks, and hand the heavy lifting to assembly when necessary.

Another benefit is that we can optimize code without needing to submit all of it to compiler options, which can generate unexpected results (specially when multi-threading).

Wednesday, March 22, 2017

Lab 6 - Auto-Vectorization with gcc

This is an exercise on Auto-Vectorization. The purpose is to analyse a simple source code's compilation through its disassembly and modify the source in order to aid the compiler in making a more efficient binary.

The source was created by following these guidelines. The following functions take place in the main file and are responsible for adding pairs of array elements and saving into a third array, and adding all elements of an array into a single value, respectively.


//adds pairs of array elements
void addArr(int* p3, int* p1, int* p2, int length){

    int i;
    
    for(i=0; i<length; i++)
        p3[i] = p1[i] + p2[i];
        
}

//adds elements of an int array
long addElems(int *arr , int length){

    int i;
    long sum = 0;

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}


By compiling the main file on an Aarch64 system with no compiler options, we get the following disassembly for the first function:


0000000000400660 <addArr>:
  400660:       d100c3ff        sub     sp, sp, #0x30 --> Creating Stack
  400664:       f9000fe0        str     x0, [sp,#24] --> Saving p3 pointer
  400668:       f9000be1        str     x1, [sp,#16] --> Saving p1 pointer
  40066c:       f90007e2        str     x2, [sp,#8] --> Saving p2 pointer
  400670:       b90007e3        str     w3, [sp,#4] --> Saving length int
  400674:       b9002fff        str     wzr, [sp,#44] ---> assigning zero to i 
  400678:       14000014        b       4006c8 <addArr+0x68> --->Start of loop
  40067c:       b9802fe0        ldrsw   x0, [sp,#44] ---> load i into r0
  400680:       d37ef400        lsl     x0, x0, #2 --> Quadruples i (int = 4 bytes)
  400684:       f9400fe1        ldr     x1, [sp,#24] --> loads p3 (pointer) into r1
  400688:       8b000020        add     x0, x1, x0 --> offsets p3 pointer by i, save in r0
  40068c:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  400690:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  400694:       f9400be2        ldr     x2, [sp,#16] --> load p1 (pointer) into r2
  400698:       8b010041        add     x1, x2, x1 --> offsets p1 pointer by i, save in r1
  40069c:       b9400022        ldr     w2, [x1] --> load p1 int value into r2
  4006a0:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  4006a4:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  4006a8:       f94007e3        ldr     x3, [sp,#8] --> load p2 (pointer) into r3
  4006ac:       8b010061        add     x1, x3, x1 --> offsets p2 pointer by i, save in r1
  4006b0:       b9400021        ldr     w1, [x1] --> load p2 int value into r1
  4006b4:       0b010041        add     w1, w2, w1 --> finally adds pair of elements, save in r1
  4006b8:       b9000001        str     w1, [x0] --> store into p3 array
  4006bc:       b9402fe0        ldr     w0, [sp,#44] \
  4006c0:       11000400        add     w0, w0, #0x1  --> Inc i
  4006c4:       b9002fe0        str     w0, [sp,#44] /
  4006c8:       b9402fe1        ldr     w1, [sp,#44] ---> loads i into r1
  4006cc:       b94007e0        ldr     w0, [sp,#4] ---> loads length int into r0
  4006d0:       6b00003f        cmp     w1, w0  ---> i - length
  4006d4:       54fffd4b        b.lt    40067c <addArr+0x1c> ---> if negative, loop again
  4006d8:       9100c3ff        add     sp, sp, #0x30 ---> restores stack pointer
  4006dc:       d65f03c0        ret   ---> return to caller

As we can see from the code, there is no vectorization happening. The function is simply getting one addition value per loop. Also, we notice that the program is using the registers very conservatively. It stores all values into RAM and never gets to use the r4 register, in order to "play safe" in relation to other calls in the program. Using RAM is extremely slow to the CPU.

Now using the '-O1' compiler option:

0000000000400660 <addArr>:
  400660:       6b1f007f        cmp     w3, wzr  --> w3 originally has 'length'
  400664:       5400018d        b.le    400694 <addArr+0x34>
  400668:       51000466        sub     w6, w3, #0x1
  40066c:       910004c6        add     x6, x6, #0x1
  400670:       d37ef4c6        lsl     x6, x6, #2
  400674:       d2800003        mov     x3, #0x0                        // #0
  400678:       b8636825        ldr     w5, [x1,x3]
  40067c:       b8636844        ldr     w4, [x2,x3] --> r1 and r2 have the pointers to p1 and p2
  400680:       0b0400a4        add     w4, w5, w4
  400684:       b8236804        str     w4, [x0,x3] --> r0 has pointer to p3
  400688:       91001063        add     x3, x3, #0x4
  40068c:       eb06007f        cmp     x3, x6
  400690:       54ffff41        b.ne    400678 <addArr+0x18>
  400694:       d65f03c0        ret


We can see that it processes the results in a more direct way and minimally uses the RAM, while not allocating any stack. Still, the compiler wouldn't vectorize the code, as auto-vectorization is not enabled. In order to enable it, we either need to use -O3, which bundles various flags, or just by adding the single '-ftree-vectorize' flag when compiling.

Let's just try using the single flag, while on -O1:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 540007ad  b.le 400758 <addArr+0xf8>
  400668: 91004004  add x4, x0, #0x10
  40066c: eb04003f  cmp x1, x4
  400670: 1a9f37e6  cset w6, cs
  400674: 91004025  add x5, x1, #0x10
  400678: eb05001f  cmp x0, x5
  40067c: 1a9f37e5  cset w5, cs
  400680: 2a0500c5  orr w5, w6, w5
  400684: eb04005f  cmp x2, x4
  400688: 1a9f37e6  cset w6, cs
  40068c: 91004044  add x4, x2, #0x10
  400690: eb04001f  cmp x0, x4
  400694: 1a9f37e4  cset w4, cs
  400698: 2a0400c4  orr w4, w6, w4
  40069c: 6a0400bf  tst w5, w4
  4006a0: 54000460  b.eq 40072c <addArr+0xcc>
  4006a4: 71000c7f  cmp w3, #0x3
  4006a8: 54000429  b.ls 40072c <addArr+0xcc>
  4006ac: 53027c66  lsr w6, w3, #2
  4006b0: 531e74c7  lsl w7, w6, #2
  4006b4: 34000387  cbz w7, 400724 <addArr+0xc4>
  4006b8: d2800004  mov x4, #0x0                    // #0
  4006bc: 2a0403e5  mov w5, w4
  4006c0: 8b040028  add x8, x1, x4
  4006c4: 4c407901  ld1 {v1.4s}, [x8]
  4006c8: 8b040048  add x8, x2, x4
  4006cc: 4c407900  ld1 {v0.4s}, [x8]
  4006d0: 4ea08420  add v0.4s, v1.4s, v0.4s --> Vectorization!
  4006d4: 8b040008  add x8, x0, x4
  4006d8: 4c007900  st1 {v0.4s}, [x8]
  4006dc: 110004a5  add w5, w5, #0x1
  4006e0: 91004084  add x4, x4, #0x10
  4006e4: 6b0600bf  cmp w5, w6
  4006e8: 54fffec3  b.cc 4006c0 <addArr+0x60>
  4006ec: 1400000a  b 400714 <addArr+0xb4>
  4006f0: 937e7c85  sbfiz x5, x4, #2, #32
  4006f4: b8656827  ldr w7, [x1,x5]
  4006f8: b8656846  ldr w6, [x2,x5]
  4006fc: 0b0600e6  add w6, w7, w6
  400700: b8256806  str w6, [x0,x5]
  400704: 11000484  add w4, w4, #0x1
  400708: 6b04007f  cmp w3, w4
  40070c: 54ffff2c  b.gt 4006f0 <addArr+0x90>
  400710: 14000012  b 400758 <addArr+0xf8>
  400714: 2a0703e4  mov w4, w7
  400718: 6b07007f  cmp w3, w7
  40071c: 54fffea1  b.ne 4006f0 <addArr+0x90>
  400720: 1400000e  b 400758 <addArr+0xf8>
  400724: 52800004  mov w4, #0x0                    // #0
  400728: 17fffff2  b 4006f0 <addArr+0x90>
  40072c: 51000466  sub w6, w3, #0x1
  400730: 910004c6  add x6, x6, #0x1
  400734: d37ef4c6  lsl x6, x6, #2
  400738: d2800003  mov x3, #0x0                    // #0
  40073c: b8636825  ldr w5, [x1,x3]
  400740: b8636844  ldr w4, [x2,x3]
  400744: 0b0400a4  add w4, w5, w4
  400748: b8236804  str w4, [x0,x3]
  40074c: 91001063  add x3, x3, #0x4
  400750: eb06007f  cmp x3, x6
  400754: 54ffff41  b.ne 40073c <addArr+0xdc>
  400758: d65f03c0  ret


We finally get vectorized operations! Although it does an awful lot of comparisons, probably to check alignment, resulting in this 64-line code.

As most modern processors align variables, we can pretty much tell the compiler to assume that they are, in fact, aligned. Following some techniques in this article, our function now looks like this:


addArr(int *__restrict p3, int *__restrict p1, int *__restrict p2, int length){
    int i;
    int *arr3 = __builtin_assume_aligned(p3, 16);
    int *arr2 = __builtin_assume_aligned(p2, 16);
    int *arr1 = __builtin_assume_aligned(p1, 16);

    for(i=0; i<length; i++)
        arr3[i] = arr1[i] + arr2[i];
}

The '__builtin_assume_aligned' function is hinting the compiler that the values are at least 16-byte aligned, and the '__restrict' keyword is telling it that the pointers are not overlapping into different regions of memory at runtime. All this allows the compiler to perform some optimizations, that result in this shorter code:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 5400046d  b.le 4006f0 <addArr+0x90>
  400668: 53027c66  lsr w6, w3, #2
  40066c: 531e74c5  lsl w5, w6, #2
  400670: 340003c5  cbz w5, 4006e8 <addArr+0x88>
  400674: 71000c7f  cmp w3, #0x3
  400678: 54000389  b.ls 4006e8 <addArr+0x88>
  40067c: d2800004  mov x4, #0x0                    // #0
  400680: 2a0403e7  mov w7, w4
  400684: 8b040008  add x8, x0, x4
  400688: 8b04002a  add x10, x1, x4
  40068c: 8b040049  add x9, x2, x4
  400690: 4c407941  ld1 {v1.4s}, [x10]
  400694: 4c407920  ld1 {v0.4s}, [x9]
  400698: 4ea08420  add v0.4s, v1.4s, v0.4s    ---> Still vectorizing!
  40069c: 4c007900  st1 {v0.4s}, [x8]
  4006a0: 110004e7  add w7, w7, #0x1
  4006a4: 91004084  add x4, x4, #0x10
  4006a8: 6b0700df  cmp w6, w7
  4006ac: 54fffec8  b.hi 400684 <addArr+0x24>
  4006b0: 1400000a  b 4006d8 <addArr+0x78>
  4006b4: 937e7c85  sbfiz x5, x4, #2, #32
  4006b8: b8656827  ldr w7, [x1,x5]
  4006bc: b8656846  ldr w6, [x2,x5]
  4006c0: 0b0600e6  add w6, w7, w6
  4006c4: b8256806  str w6, [x0,x5]
  4006c8: 11000484  add w4, w4, #0x1
  4006cc: 6b04007f  cmp w3, w4
  4006d0: 54ffff2c  b.gt 4006b4 <addArr+0x54>
  4006d4: 14000007  b 4006f0 <addArr+0x90>
  4006d8: 2a0503e4  mov w4, w5
  4006dc: 6b05007f  cmp w3, w5
  4006e0: 54fffea1  b.ne 4006b4 <addArr+0x54>
  4006e4: 14000003  b 4006f0 <addArr+0x90>
  4006e8: 52800004  mov w4, #0x0                    // #0
  4006ec: 17fffff2  b 4006b4 <addArr+0x54>
  4006f0: d65f03c0  ret

Applying the same techniques to the 'addElems' function:


long addElems(int *__restrict p_arr , int length){

    int i;
    long sum = 0;
    int *arr = __builtin_assume_aligned(p_arr, 16);

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}

And we also get a vectorized and relatively short assembly code.

Conclusion

As GCC supports many different processors and architectures, it needs to maintain stability between these different platforms. This results in a lot of the optimizations not taking place automatically and leaves a lot of it up to the programmer to figure out.

Using vectorization can grant huge performance boosts and there are a myriad of existing software that don't take advantage of it, sometimes opting for multi-threaded solutions instead. With that said, there's a great untouched potential to improve these software with vectorization and that might lead for better auto-vectorization support on part of the compiler, and CPU manufacturers.


Saturday, March 4, 2017

Lab 5 - Compiler Optimization and Algorithm Selection - Part 2


Using floating point for the volume scaling factor


From our original example, we were using a floating point value to represent the volume scaling factor, and then multiplying each 16-bit integer sample by it. From a higher level language such as C, it makes a lot of sense of how we should manage to get the result. Simply multiply both values and extract the whole number part. Let's see how the computer actually perceives it.



        /////  main process block /////
        gettimeofday(&t1, NULL);
  400c51:       be 00 00 00 00          mov    $0x0,%esi
  400c56:       bf 90 20 60 00          mov    $0x602090,%edi
  400c5b:       e8 20 fb ff ff          callq  400780 

        for (i = 0; i < INPUT_NUM ; i++) {
  400c60:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
  400c67:       00 
  400c68:       eb 4b                   jmp    400cb5 
input[i] *= 0.5; 400c6a: 48 8b 45 a0 mov -0x60(%rbp),%rax 400c6e: 48 8d 14 00 lea (%rax,%rax,1),%rdx 400c72: 48 8b 45 d0 mov -0x30(%rbp),%rax 400c76: 48 01 d0 add %rdx,%rax 400c79: 48 8b 55 a0 mov -0x60(%rbp),%rdx 400c7d: 48 8d 0c 12 lea (%rdx,%rdx,1),%rcx 400c81: 48 8b 55 d0 mov -0x30(%rbp),%rdx 400c85: 48 01 ca add %rcx,%rdx 400c88: 0f b7 12 movzwl (%rdx),%edx 400c8b: 0f bf d2 movswl %dx,%edx 400c8e: 66 0f ef c0 pxor %xmm0,%xmm0 400c92: f2 0f 2a c2 cvtsi2sd %edx,%xmm0 400c96: f2 0f 10 0d 7a 04 00 movsd 0x47a(%rip),%xmm1 # 401118 <__dso_handle x140=""> 400c9d: 00 400c9e: f2 0f 59 c1 mulsd %xmm1,%xmm0 400ca2: f2 0f 2c d0 cvttsd2si %xmm0,%edx 400ca6: 66 89 10 mov %dx,(%rax) ///// main process block ///// gettimeofday(&t1, NULL); for (i = 0; i < INPUT_NUM ; i++) { 400ca9: 48 8b 45 a0 mov -0x60(%rbp),%rax 400cad: 48 83 c0 01 add $0x1,%rax 400cb1: 48 89 45 a0 mov %rax,-0x60(%rbp) 400cb5: 48 8b 45 a0 mov -0x60(%rbp),%rax 400cb9: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400cbf: 76 a9 jbe 400c6a
//input[i] >>= 1; input[i] *= 0.5; } gettimeofday(&t2, NULL); 400cc1: be 00 00 00 00 mov $0x0,%esi 400cc6: bf b0 20 60 00 mov $0x6020b0,%edi 400ccb: e8 b0 fa ff ff callq 400780 ///////////////////////////////

That's a lot of instructions for one simple loop!

When getting into the for loop, the CPU first loads zero into 'i'(%rbp-0x60) and jumps down to check if it is greater than INPUT_NUM-1. Afterwards, it jumps back up and loads the input sample value into %rdx based on the i offset. Then it converts it to a 16-bit signed integer, so it has the higher order bits zeroed out, and does another conversion to a 64-bit double precision floating point while loads it into the higher part of the 128-bit xmm0 register! A lot of conversions happened just to be able to multiply a 16-bit value  with a float value.

After getting the sample into xmm0, the program then gets the volume scaling factor into xmm1, and multiply both registers. Then the program then converts the result in xmm0 to a 32-bit signed integer and saves it into %edx, in which %dx is then moved to the input array on the address specified by %rax.

That's a lot of extra work to get the multiplication of two numbers. So it doesn't lose precision, the compiler prefers going through all this conversion. We could modify the program so the values to be multiplied are compatible enough to not need any conversion, but the volume scaling factor is supposed to be between 0.000 and 1.000, so how can we represent decimal numbers in a 16-bit format?

Fixed-Point Numbers

 

Fixed-point numbers is a way to represent fractional numbers in binary. Here's how it works:


On this one-byte representation, we can see that, by implying a radix point in the middle, we are assigning 4 bits to represent the whole number part and 4 bits to represent the fractional part.

Naturally, there's a huge loss of many possible numbers that could be represented if using FPUs. We can know the resolution of the fractional part by finding out the lowest possible value that can be incremented. On the example above, it would be  2-4 = 1/16 = 0.0625

Fixed-point numbers are represented with the Qm.n format notation, where m is the number of bits for the whole number part and n is the number of bits for the fractional part. If there is just one number after the Q, it is assigned to n.  additionally, unsigned numbers are represented with an U prefixed to the Q. The example above has a Q notation of UQ4.4 .



Implementation


Fixed-point number representation is manly a theory, meaning that the computational implementations are subject to context. There aren't any special calls to make the computer know that a number of bits of a certain byte are actually representing the fractional part.

In our case, we also need to be able to convert a given volume scaling factor between 0.000 and 1.000 to the closest possible binary value. This can be done with the Float-to-Q conversion:


Our volume scaling factor only needs a whole number range of 0 to 1 and doesn't really need to be negative, so it will be a UQ1.15 number, having a resolution of 3.0518 x 10-5. Once we multiply it by our sample, it will become a 32-bit value, so we need to perform a bit-shift-right operation n times.

Here's the changed code:


 float vol = 0.333;
 // converting float into uint16_t: 1 whole, 15 decimals
 uint16_t vol16 = (int)(vol * pow(2,15));

 /////  main process block /////
 gettimeofday(&t1, NULL);

 for (i = 0; i < INPUT_NUM ; i++) {
  input[i] = ((input[i] * vol16) >> 15);

 }
 gettimeofday(&t2, NULL);
 ///////////////////////////////


Results

 

The final test involved running the same code of the simple(original) and fixed-point implementations on the x86_64(xerxes) and Aarch64(betty) architectures. Additionally, I compiled the same sources with the -O3 option. Each of the four versions ran three times on each architecture and the average time is the final result.


On xerxes, we can notice that the fixed point implementation actually wasn't an improvement for the default compilation, but it was on the optimized one.

As for betty, there was a significant improvement with the fixed-point implementation, for both compilation options.


Conclusion


The x86_64 seemed not to accept well the fixed-point implementation unless if it were told not to worry to much about the accuracy and other processes. For example, the thread that was tracking the counter and displaying results didn't work at all, because the value on the address of 'i' wasn't being used as a counter by the CPU and it was constantly with a value of 5, while its assembly code became much longer and was jumping around, somehow making it faster.

On betty, the same thread issue happened, but all implementations were an improvement over the original. I suspect that, for ARM being very common among embedded devices, where floating point operations are not always available or convenient, the fixed-point operations ended up making a huge improvement.