le bloge de blote: Lab 5 - Compiler Optimization and Algorithm Selection

Using floating point for the volume scaling factor

From our original example, we were using a floating point value to represent the volume scaling factor, and then multiplying each 16-bit integer sample by it. From a higher level language such as C, it makes a lot of sense of how we should manage to get the result. Simply multiply both values and extract the whole number part. Let's see how the computer actually perceives it.


        /////  main process block /////
        gettimeofday(&t1, NULL);
  400c51:       be 00 00 00 00          mov    $0x0,%esi
  400c56:       bf 90 20 60 00          mov    $0x602090,%edi
  400c5b:       e8 20 fb ff ff          callq  400780 

        for (i = 0; i < INPUT_NUM ; i++) {
  400c60:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
  400c67:       00 
  400c68:       eb 4b                   jmp    400cb5 
                input[i] *= 0.5;
  400c6a:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400c6e:       48 8d 14 00             lea    (%rax,%rax,1),%rdx
  400c72:       48 8b 45 d0             mov    -0x30(%rbp),%rax
  400c76:       48 01 d0                add    %rdx,%rax
  400c79:       48 8b 55 a0             mov    -0x60(%rbp),%rdx
  400c7d:       48 8d 0c 12             lea    (%rdx,%rdx,1),%rcx
  400c81:       48 8b 55 d0             mov    -0x30(%rbp),%rdx
  400c85:       48 01 ca                add    %rcx,%rdx
  400c88:       0f b7 12                movzwl (%rdx),%edx
  400c8b:       0f bf d2                movswl %dx,%edx
  400c8e:       66 0f ef c0             pxor   %xmm0,%xmm0
  400c92:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
  400c96:       f2 0f 10 0d 7a 04 00    movsd  0x47a(%rip),%xmm1        # 401118 <__dso_handle x140="">
  400c9d:       00 
  400c9e:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  400ca2:       f2 0f 2c d0             cvttsd2si %xmm0,%edx
  400ca6:       66 89 10                mov    %dx,(%rax)


        /////  main process block /////
        gettimeofday(&t1, NULL);

        for (i = 0; i < INPUT_NUM ; i++) {
  400ca9:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400cad:       48 83 c0 01             add    $0x1,%rax
  400cb1:       48 89 45 a0             mov    %rax,-0x60(%rbp)
  400cb5:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400cb9:       48 3d ff 64 cd 1d       cmp    $0x1dcd64ff,%rax
  400cbf:       76 a9                   jbe    400c6a 
                //input[i] >>= 1;
                input[i] *= 0.5;


        }
        gettimeofday(&t2, NULL);
  400cc1:       be 00 00 00 00          mov    $0x0,%esi
  400cc6:       bf b0 20 60 00          mov    $0x6020b0,%edi
  400ccb:       e8 b0 fa ff ff          callq  400780 
        ///////////////////////////////

That's a lot of instructions for one simple loop!

When getting into the for loop, the CPU first loads zero into 'i'(%rbp-0x60) and jumps down to check if it is greater than INPUT_NUM-1. Afterwards, it jumps back up and loads the input sample value into %rdx based on the i offset. Then it converts it to a 16-bit signed integer, so it has the higher order bits zeroed out, and does another conversion to a 64-bit double precision floating point while loads it into the higher part of the 128-bit xmm0 register! A lot of conversions happened just to be able to multiply a 16-bit value with a float value.

After getting the sample into xmm0, the program then gets the volume scaling factor into xmm1, and multiply both registers. Then the program then converts the result in xmm0 to a 32-bit signed integer and saves it into %edx, in which %dx is then moved to the input array on the address specified by %rax.

That's a lot of extra work to get the multiplication of two numbers. So it doesn't lose precision, the compiler prefers going through all this conversion. We could modify the program so the values to be multiplied are compatible enough to not need any conversion, but the volume scaling factor is supposed to be between 0.000 and 1.000, so how can we represent decimal numbers in a 16-bit format?

Fixed-Point Numbers

Fixed-point numbers is a way to represent fractional numbers in binary. Here's how it works:

On this one-byte representation, we can see that, by implying a radix point in the middle, we are assigning 4 bits to represent the whole number part and 4 bits to represent the fractional part.

Naturally, there's a huge loss of many possible numbers that could be represented if using FPUs. We can know the resolution of the fractional part by finding out the lowest possible value that can be incremented. On the example above, it would be 2^-4 = 1/16 = 0.0625

Fixed-point numbers are represented with the Qm.n format notation, where m is the number of bits for the whole number part and n is the number of bits for the fractional part. If there is just one number after the Q, it is assigned to n. additionally, unsigned numbers are represented with an U prefixed to the Q. The example above has a Q notation of UQ4.4 .

Implementation

Fixed-point number representation is manly a theory, meaning that the computational implementations are subject to context. There aren't any special calls to make the computer know that a number of bits of a certain byte are actually representing the fractional part.

In our case, we also need to be able to convert a given volume scaling factor between 0.000 and 1.000 to the closest possible binary value. This can be done with the Float-to-Q conversion:

Our volume scaling factor only needs a whole number range of 0 to 1 and doesn't really need to be negative, so it will be a UQ_1.15 number, having a resolution of 3.0518 x 10^-5. Once we multiply it by our sample, it will become a 32-bit value, so we need to perform a bit-shift-right operation n times.

Here's the changed code:


 float vol = 0.333;
 // converting float into uint16_t: 1 whole, 15 decimals
 uint16_t vol16 = (int)(vol * pow(2,15));

 /////  main process block /////
 gettimeofday(&t1, NULL);

 for (i = 0; i < INPUT_NUM ; i++) {
  input[i] = ((input[i] * vol16) >> 15);

 }
 gettimeofday(&t2, NULL);
 ///////////////////////////////

Results

The final test involved running the same code of the simple(original) and fixed-point implementations on the x86_64(xerxes) and Aarch64(betty) architectures. Additionally, I compiled the same sources with the -O3 option. Each of the four versions ran three times on each architecture and the average time is the final result.

On xerxes, we can notice that the fixed point implementation actually wasn't an improvement for the default compilation, but it was on the optimized one.

As for betty, there was a significant improvement with the fixed-point implementation, for both compilation options.

Conclusion

The x86_64 seemed not to accept well the fixed-point implementation unless if it were told not to worry to much about the accuracy and other processes. For example, the thread that was tracking the counter and displaying results didn't work at all, because the value on the address of 'i' wasn't being used as a counter by the CPU and it was constantly with a value of 5, while its assembly code became much longer and was jumping around, somehow making it faster.

On betty, the same thread issue happened, but all implementations were an improvement over the original. I suspect that, for ARM being very common among embedded devices, where floating point operations are not always available or convenient, the fixed-point operations ended up making a huge improvement.

le bloge de blote

Blog Archive

Saturday, March 4, 2017

Lab 5 - Compiler Optimization and Algorithm Selection - Part 2

Using floating point for the volume scaling factor

Fixed-Point Numbers

Implementation

Results

Conclusion

No comments:

Post a Comment