Saturday, March 4, 2017

Lab 5 - Compiler Optimization and Algorithm Selection - Part 2


Using floating point for the volume scaling factor


From our original example, we were using a floating point value to represent the volume scaling factor, and then multiplying each 16-bit integer sample by it. From a higher level language such as C, it makes a lot of sense of how we should manage to get the result. Simply multiply both values and extract the whole number part. Let's see how the computer actually perceives it.



        /////  main process block /////
        gettimeofday(&t1, NULL);
  400c51:       be 00 00 00 00          mov    $0x0,%esi
  400c56:       bf 90 20 60 00          mov    $0x602090,%edi
  400c5b:       e8 20 fb ff ff          callq  400780 

        for (i = 0; i < INPUT_NUM ; i++) {
  400c60:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
  400c67:       00 
  400c68:       eb 4b                   jmp    400cb5 
input[i] *= 0.5; 400c6a: 48 8b 45 a0 mov -0x60(%rbp),%rax 400c6e: 48 8d 14 00 lea (%rax,%rax,1),%rdx 400c72: 48 8b 45 d0 mov -0x30(%rbp),%rax 400c76: 48 01 d0 add %rdx,%rax 400c79: 48 8b 55 a0 mov -0x60(%rbp),%rdx 400c7d: 48 8d 0c 12 lea (%rdx,%rdx,1),%rcx 400c81: 48 8b 55 d0 mov -0x30(%rbp),%rdx 400c85: 48 01 ca add %rcx,%rdx 400c88: 0f b7 12 movzwl (%rdx),%edx 400c8b: 0f bf d2 movswl %dx,%edx 400c8e: 66 0f ef c0 pxor %xmm0,%xmm0 400c92: f2 0f 2a c2 cvtsi2sd %edx,%xmm0 400c96: f2 0f 10 0d 7a 04 00 movsd 0x47a(%rip),%xmm1 # 401118 <__dso_handle x140=""> 400c9d: 00 400c9e: f2 0f 59 c1 mulsd %xmm1,%xmm0 400ca2: f2 0f 2c d0 cvttsd2si %xmm0,%edx 400ca6: 66 89 10 mov %dx,(%rax) ///// main process block ///// gettimeofday(&t1, NULL); for (i = 0; i < INPUT_NUM ; i++) { 400ca9: 48 8b 45 a0 mov -0x60(%rbp),%rax 400cad: 48 83 c0 01 add $0x1,%rax 400cb1: 48 89 45 a0 mov %rax,-0x60(%rbp) 400cb5: 48 8b 45 a0 mov -0x60(%rbp),%rax 400cb9: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax 400cbf: 76 a9 jbe 400c6a
//input[i] >>= 1; input[i] *= 0.5; } gettimeofday(&t2, NULL); 400cc1: be 00 00 00 00 mov $0x0,%esi 400cc6: bf b0 20 60 00 mov $0x6020b0,%edi 400ccb: e8 b0 fa ff ff callq 400780 ///////////////////////////////

That's a lot of instructions for one simple loop!

When getting into the for loop, the CPU first loads zero into 'i'(%rbp-0x60) and jumps down to check if it is greater than INPUT_NUM-1. Afterwards, it jumps back up and loads the input sample value into %rdx based on the i offset. Then it converts it to a 16-bit signed integer, so it has the higher order bits zeroed out, and does another conversion to a 64-bit double precision floating point while loads it into the higher part of the 128-bit xmm0 register! A lot of conversions happened just to be able to multiply a 16-bit value  with a float value.

After getting the sample into xmm0, the program then gets the volume scaling factor into xmm1, and multiply both registers. Then the program then converts the result in xmm0 to a 32-bit signed integer and saves it into %edx, in which %dx is then moved to the input array on the address specified by %rax.

That's a lot of extra work to get the multiplication of two numbers. So it doesn't lose precision, the compiler prefers going through all this conversion. We could modify the program so the values to be multiplied are compatible enough to not need any conversion, but the volume scaling factor is supposed to be between 0.000 and 1.000, so how can we represent decimal numbers in a 16-bit format?

Fixed-Point Numbers

 

Fixed-point numbers is a way to represent fractional numbers in binary. Here's how it works:


On this one-byte representation, we can see that, by implying a radix point in the middle, we are assigning 4 bits to represent the whole number part and 4 bits to represent the fractional part.

Naturally, there's a huge loss of many possible numbers that could be represented if using FPUs. We can know the resolution of the fractional part by finding out the lowest possible value that can be incremented. On the example above, it would be  2-4 = 1/16 = 0.0625

Fixed-point numbers are represented with the Qm.n format notation, where m is the number of bits for the whole number part and n is the number of bits for the fractional part. If there is just one number after the Q, it is assigned to n.  additionally, unsigned numbers are represented with an U prefixed to the Q. The example above has a Q notation of UQ4.4 .



Implementation


Fixed-point number representation is manly a theory, meaning that the computational implementations are subject to context. There aren't any special calls to make the computer know that a number of bits of a certain byte are actually representing the fractional part.

In our case, we also need to be able to convert a given volume scaling factor between 0.000 and 1.000 to the closest possible binary value. This can be done with the Float-to-Q conversion:


Our volume scaling factor only needs a whole number range of 0 to 1 and doesn't really need to be negative, so it will be a UQ1.15 number, having a resolution of 3.0518 x 10-5. Once we multiply it by our sample, it will become a 32-bit value, so we need to perform a bit-shift-right operation n times.

Here's the changed code:


 float vol = 0.333;
 // converting float into uint16_t: 1 whole, 15 decimals
 uint16_t vol16 = (int)(vol * pow(2,15));

 /////  main process block /////
 gettimeofday(&t1, NULL);

 for (i = 0; i < INPUT_NUM ; i++) {
  input[i] = ((input[i] * vol16) >> 15);

 }
 gettimeofday(&t2, NULL);
 ///////////////////////////////


Results

 

The final test involved running the same code of the simple(original) and fixed-point implementations on the x86_64(xerxes) and Aarch64(betty) architectures. Additionally, I compiled the same sources with the -O3 option. Each of the four versions ran three times on each architecture and the average time is the final result.


On xerxes, we can notice that the fixed point implementation actually wasn't an improvement for the default compilation, but it was on the optimized one.

As for betty, there was a significant improvement with the fixed-point implementation, for both compilation options.


Conclusion


The x86_64 seemed not to accept well the fixed-point implementation unless if it were told not to worry to much about the accuracy and other processes. For example, the thread that was tracking the counter and displaying results didn't work at all, because the value on the address of 'i' wasn't being used as a counter by the CPU and it was constantly with a value of 5, while its assembly code became much longer and was jumping around, somehow making it faster.

On betty, the same thread issue happened, but all implementations were an improvement over the original. I suspect that, for ARM being very common among embedded devices, where floating point operations are not always available or convenient, the fixed-point operations ended up making a huge improvement.

No comments:

Post a Comment