For demonstrating the implementation of inline assembly, I'll be optimizing the code from lab 5 where 500 million 16-bit sound samples were modified by a volume scaling factor. The final solution I had come up with was the following:
float vol = 0.333;
// converting float into uint16_t: 1 whole, 15 decimals
uint16_t vol16 = (int)(vol * pow(2,15));
///// main process block /////
gettimeofday(&t1, NULL);
for (i = 0; i < INPUT_NUM ; i++)
input[i] = ((input[i] * vol16) >> 15);
gettimeofday(&t2, NULL);
///////////////////////////////
This solution involves converting the float volume scaling factor into a fixed point format and then performing the multiplication. The results for compiling without any options for 500 million samples was very close to 4 seconds (3959 ms), and when compiling with the -O3 option, the time was cut down to 362ms.
Implementing inline assembly
I decided to create a more general function to introduce the inline assembly into the code. The code block above turned into this: float vol = 0.333;
///// main process block /////
gettimeofday(&t1, NULL);
adjustVolume(input, vol, INPUT_NUM);
gettimeofday(&t2, NULL);
///////////////////////////////
And the final function is as follows:
void adjustVolume(int16_t* samples, float vsf, unsigned long size){
unsigned long nli;
int samples_left;
int16_t* i;
//convert float to UQ1.15 fixed point
uint16_t vol16 = (int)(vsf * pow(2,15));
//see if there's left over values
samples_left = size % 8;//8 samples at a time
//loop until before non loadable index
nli = size - samples_left;
int16_t* p_nli = &samples[nli];
for(i = samples; i < p_nli ; i+=8){ //8 samples
__asm__("DUP v1.8h, %w[volume]\n\t" //load volume into v1
"LD1 {v0.8h}, [%[smp_ptr]]\n\t" //load 8 16-bit samples in v0
"SQDMULH v0.8h, v0.8h, v1.8h\n\t" //adjust 8 samples at a time; save in v0
"ST1 {v0.8h}, [%[smp_ptr]]\n\t" //store back into samples array
:
: [volume] "r" (vol16), [smp_ptr]"r"(i)//input
:
);
}
while(samples_left){
samples[size-samples_left-1] = ((samples[size-samples_left-1] * vol16) >> 15);
samples_left--;
}
}
The function starts by converting the volume scaling factor float into a fixed point, for making the load of the value more convenient. Getting into the assembly, we give the converted 16-bit volume and the pointer to the array that increments by 8 16-bit elements each loop. Dup and LD1 loads the volume and sample pointer into different vectors. SQDMULH multiplies those vectors, which creates a 32-bit value, and overwrites the samples vector with the high half of the result. ST1 then stores the adjusted samples into the array in allocated in memory.
To have a better idea of what is going on, here's a visualization using gdb:
Commands:
- ni: next instruction
- x/h $[register]: show contents in memory (x) pointed by value in register, format as half-word (h)
- p $[register]: print value in register. I the case above, only showing half-word format
Wrapping the function, is a loop controlled by an int16_t pointer that skips 8 elements each loop, and exits when there aren't 8 elements to be loaded. The function checks for such cases beforehand, to minimize comparisons in each loop. Worst cases would be if size % 8 = 7, and those last samples are individually adjusted after the main process.
Results!
Finally, this section compares the results from the implementation on lab 5, simply using the fixed point implementation on 500 million samples, to this lab's implementation, using inline assembly on 500 000 007 samples, just to get the worst case possible. Although, those last samples make almost no difference to the result.So, huge improvement using the inline assembly with no compiler options, although no improvement on the optimized compilation. This is probably due to the extra overhead in the function, and maybe those extra samples made a dent after all, but that was to be expected, since -O3 enables the auto-vectorization option. Still, compiling the inline assembly code with -O1 already gives results very close to the -O3 option (372ms), while the older version at -O1 gives results around 760ms.
Conclusion
Using inline assembly allows us to make bare metal optimizations for the parts where performance really matters. For single operations, there's no real need to use assembly, so we can enjoy the abstraction C provides for most of the tasks, and hand the heavy lifting to assembly when necessary.Another benefit is that we can optimize code without needing to submit all of it to compiler options, which can generate unexpected results (specially when multi-threading).
No comments:
Post a Comment