Using floating point for the volume scaling factor
From our original example, we were using a floating point value to represent the volume scaling factor, and then multiplying each 16-bit integer sample by it. From a higher level language such as C, it makes a lot of sense of how we should manage to get the result. Simply multiply both values and extract the whole number part. Let's see how the computer actually perceives it.
///// main process block /////
gettimeofday(&t1, NULL);
400c51: be 00 00 00 00 mov $0x0,%esi
400c56: bf 90 20 60 00 mov $0x602090,%edi
400c5b: e8 20 fb ff ff callq 400780
for (i = 0; i < INPUT_NUM ; i++) {
400c60: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
400c67: 00
400c68: eb 4b jmp 400cb5
input[i] *= 0.5;
400c6a: 48 8b 45 a0 mov -0x60(%rbp),%rax
400c6e: 48 8d 14 00 lea (%rax,%rax,1),%rdx
400c72: 48 8b 45 d0 mov -0x30(%rbp),%rax
400c76: 48 01 d0 add %rdx,%rax
400c79: 48 8b 55 a0 mov -0x60(%rbp),%rdx
400c7d: 48 8d 0c 12 lea (%rdx,%rdx,1),%rcx
400c81: 48 8b 55 d0 mov -0x30(%rbp),%rdx
400c85: 48 01 ca add %rcx,%rdx
400c88: 0f b7 12 movzwl (%rdx),%edx
400c8b: 0f bf d2 movswl %dx,%edx
400c8e: 66 0f ef c0 pxor %xmm0,%xmm0
400c92: f2 0f 2a c2 cvtsi2sd %edx,%xmm0
400c96: f2 0f 10 0d 7a 04 00 movsd 0x47a(%rip),%xmm1 # 401118 <__dso_handle x140="">
400c9d: 00
400c9e: f2 0f 59 c1 mulsd %xmm1,%xmm0
400ca2: f2 0f 2c d0 cvttsd2si %xmm0,%edx
400ca6: 66 89 10 mov %dx,(%rax)
///// main process block /////
gettimeofday(&t1, NULL);
for (i = 0; i < INPUT_NUM ; i++) {
400ca9: 48 8b 45 a0 mov -0x60(%rbp),%rax
400cad: 48 83 c0 01 add $0x1,%rax
400cb1: 48 89 45 a0 mov %rax,-0x60(%rbp)
400cb5: 48 8b 45 a0 mov -0x60(%rbp),%rax
400cb9: 48 3d ff 64 cd 1d cmp $0x1dcd64ff,%rax
400cbf: 76 a9 jbe 400c6a
//input[i] >>= 1;
input[i] *= 0.5;
}
gettimeofday(&t2, NULL);
400cc1: be 00 00 00 00 mov $0x0,%esi
400cc6: bf b0 20 60 00 mov $0x6020b0,%edi
400ccb: e8 b0 fa ff ff callq 400780
///////////////////////////////
That's a lot of instructions for one simple loop!
When getting into the for loop, the CPU first loads zero into 'i'(%rbp-0x60) and jumps down to check if it is greater than INPUT_NUM-1. Afterwards, it jumps back up and loads the input sample value into %rdx based on the i offset. Then it converts it to a 16-bit signed integer, so it has the higher order bits zeroed out, and does another conversion to a 64-bit double precision floating point while loads it into the higher part of the 128-bit xmm0 register! A lot of conversions happened just to be able to multiply a 16-bit value with a float value.
After getting the sample into xmm0, the program then gets the volume scaling factor into xmm1, and multiply both registers. Then the program then converts the result in xmm0 to a 32-bit signed integer and saves it into %edx, in which %dx is then moved to the input array on the address specified by %rax.
That's a lot of extra work to get the multiplication of two numbers. So it doesn't lose precision, the compiler prefers going through all this conversion. We could modify the program so the values to be multiplied are compatible enough to not need any conversion, but the volume scaling factor is supposed to be between 0.000 and 1.000, so how can we represent decimal numbers in a 16-bit format?
Fixed-Point Numbers
Fixed-point numbers is a way to represent fractional numbers in binary. Here's how it works:
On this one-byte representation, we can see that, by implying a radix point in the middle, we are assigning 4 bits to represent the whole number part and 4 bits to represent the fractional part.
Naturally, there's a huge loss of many possible numbers that could be represented if using FPUs. We can know the resolution of the fractional part by finding out the lowest possible value that can be incremented. On the example above, it would be 2-4 = 1/16 = 0.0625
Fixed-point numbers are represented with the Qm.n format notation, where m is the number of bits for the whole number part and n is the number of bits for the fractional part. If there is just one number after the Q, it is assigned to n. additionally, unsigned numbers are represented with an U prefixed to the Q. The example above has a Q notation of UQ4.4 .
Implementation
Fixed-point number representation is manly a theory, meaning that the computational implementations are subject to context. There aren't any special calls to make the computer know that a number of bits of a certain byte are actually representing the fractional part.
In our case, we also need to be able to convert a given volume scaling factor between 0.000 and 1.000 to the closest possible binary value. This can be done with the Float-to-Q conversion:
Our volume scaling factor only needs a whole number range of 0 to 1 and doesn't really need to be negative, so it will be a UQ1.15 number, having a resolution of 3.0518 x 10-5. Once we multiply it by our sample, it will become a 32-bit value, so we need to perform a bit-shift-right operation n times.
Here's the changed code:
float vol = 0.333;
// converting float into uint16_t: 1 whole, 15 decimals
uint16_t vol16 = (int)(vol * pow(2,15));
///// main process block /////
gettimeofday(&t1, NULL);
for (i = 0; i < INPUT_NUM ; i++) {
input[i] = ((input[i] * vol16) >> 15);
}
gettimeofday(&t2, NULL);
///////////////////////////////
Results
The final test involved running the same code of the simple(original) and fixed-point implementations on the x86_64(xerxes) and Aarch64(betty) architectures. Additionally, I compiled the same sources with the -O3 option. Each of the four versions ran three times on each architecture and the average time is the final result.
On xerxes, we can notice that the fixed point implementation actually wasn't an improvement for the default compilation, but it was on the optimized one.
As for betty, there was a significant improvement with the fixed-point implementation, for both compilation options.
Conclusion
The x86_64 seemed not to accept well the fixed-point implementation unless if it were told not to worry to much about the accuracy and other processes. For example, the thread that was tracking the counter and displaying results didn't work at all, because the value on the address of 'i' wasn't being used as a counter by the CPU and it was constantly with a value of 5, while its assembly code became much longer and was jumping around, somehow making it faster.
On betty, the same thread issue happened, but all implementations were an improvement over the original. I suspect that, for ARM being very common among embedded devices, where floating point operations are not always available or convenient, the fixed-point operations ended up making a huge improvement.
No comments:
Post a Comment