Thursday, March 30, 2017

Lab 7 - Inline A64 Assembler

This lab will explore the benefits of using inline assembly specifically on the Aarch64. If writing assembly was as easy as writing C (and for some people is), C would be obsolete. Thankfully, C is able to provide abstraction so we don't have to deal with the processor specific instructions, and makes code portable between processors, but there are times that it just can't make full use of the processor's features that could significantly improve the performance of a certain process. At least not without a lot of tinkering to make the compiler understand that it should use those features. The most straight forward way to use those features, while having the abstraction of C, is to use inline assembly, which just embeds architecture specific assembly code within the C source code.

For demonstrating the implementation of inline assembly, I'll be optimizing the code from lab 5 where 500 million 16-bit sound samples were modified by a volume scaling factor. The final solution I had come up with was the following:


 float vol = 0.333;  
   
 // converting float into uint16_t: 1 whole, 15 decimals  
 uint16_t vol16 = (int)(vol * pow(2,15));  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 for (i = 0; i < INPUT_NUM ; i++)   
   input[i] = ((input[i] * vol16) >> 15);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////  

This solution involves converting the float volume scaling factor into a fixed point format and then performing the multiplication. The results for compiling without any options for 500 million samples was very close to 4 seconds (3959 ms), and when compiling with the -O3 option, the time was cut down to 362ms.

Implementing inline assembly

I decided to create a more general function to introduce the inline assembly into the code. The code block above turned into this:

 float vol = 0.333;  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 adjustVolume(input, vol, INPUT_NUM);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////  

And the final function is as follows:

 void adjustVolume(int16_t* samples, float vsf, unsigned long size){  
   
   unsigned long nli;  
   int samples_left;  
   int16_t* i;  
   
   //convert float to UQ1.15 fixed point  
   uint16_t vol16 = (int)(vsf * pow(2,15));  
   
   //see if there's left over values  
   samples_left = size % 8;//8 samples at a time   
   
   //loop until before non loadable index  
   nli = size - samples_left;  
   int16_t* p_nli = &samples[nli];  
   
   for(i = samples; i < p_nli ; i+=8){ //8 samples   
   
     __asm__("DUP v1.8h, %w[volume]\n\t" //load volume into v1  
         "LD1 {v0.8h}, [%[smp_ptr]]\n\t" //load 8 16-bit samples in v0  
         "SQDMULH v0.8h, v0.8h, v1.8h\n\t" //adjust 8 samples at a time; save in v0  
         "ST1 {v0.8h}, [%[smp_ptr]]\n\t" //store back into samples array  
   
         :  
   
         : [volume] "r" (vol16), [smp_ptr]"r"(i)//input  
   
         :  
   
         );  
   
   }  
   
   while(samples_left){  
   
   
     samples[size-samples_left-1] = ((samples[size-samples_left-1] * vol16) >> 15);  
   
     samples_left--;  
   
   }  
   
 }  

The function starts by converting the volume scaling factor float into a fixed point, for making the load of the value more convenient. Getting into the assembly, we give the converted 16-bit volume and the pointer to the array that increments by 8 16-bit elements each loop. Dup and LD1 loads the volume and sample pointer into different vectors. SQDMULH multiplies those vectors, which creates a 32-bit value, and overwrites the samples vector with the high half of the result. ST1 then stores the adjusted samples into the array in allocated in memory.

To have a better idea of what is going on, here's a visualization using gdb:


Commands:
  • ni: next instruction
  • x/h $[register]: show contents in memory (x) pointed by value in register, format as half-word (h)
  • p $[register]: print value in register. I the case above, only showing half-word format
At the top, we can see that the inline assembly remained intact on compilation. On the commands, we can see a value of the array in memory and then the same value printed as a signed integer in the vector register v0. v1 holds the fixed point volume scaling factor (10911), which is equivalent to 0.333 (getting close to, at least). After executing the multiplication, we see that v0 now holds the adjusted values, and one instruction further, those are stored in memory.

Wrapping the function, is a loop controlled by an int16_t pointer that skips 8 elements each loop, and exits when there aren't 8 elements to be loaded. The function checks for such cases beforehand, to minimize comparisons in each loop. Worst cases would be if size % 8 = 7, and those last samples are individually adjusted after the main process.

Results!

Finally, this section compares the results from the implementation on lab 5, simply using the fixed point implementation on 500 million samples, to this lab's implementation, using inline assembly on 500 000 007 samples, just to get the worst case possible. Although, those last samples make almost no difference to the result.


So, huge improvement using the inline assembly with no compiler options, although no improvement on the optimized compilation. This is probably due to the extra overhead in the function, and maybe those extra samples made a dent after all, but that was to be expected, since -O3 enables the auto-vectorization option. Still, compiling the inline assembly code with -O1 already gives results very close to the -O3 option (372ms), while the older version at -O1 gives results around 760ms.

Conclusion

Using inline assembly allows us to make bare metal optimizations for the parts where performance really matters. For single operations, there's no real need to use assembly, so we can enjoy the abstraction C provides for most of the tasks, and hand the heavy lifting to assembly when necessary.

Another benefit is that we can optimize code without needing to submit all of it to compiler options, which can generate unexpected results (specially when multi-threading).

No comments:

Post a Comment