Tuesday, February 28, 2017

Lab 5 - Compiler Optimization with Algorithm Selection

Optimizing code for better performance results


This assignment will test different approaches when writing code, comparing the performance. The goal is to develop an algorithm to change the constant component(e.g volume) to a series of input numbers(e.g audio samples) as fast as possible.

Ideally, to obtain a more accurate result, I would create three sample files with the Sound Exchange tool (SoX), each with different frequencies of 100Hz, 1000Hz and 10000Hz, a sample rate of 44100Hz, 16-bit signed integer per sample and 2 channels. It takes hundreds of millions of samples to get meaningful results, which would provide three files of 2GB each. Since the tests are being made on a remotely shared machined, I decided to generate the samples in the program itself, while using a fixed random seed to get the same set of samples to be compared.

The SoX command to generate each file would be as follows:

sox -n -r 44100 -b 16 -c 2 sample.wav synth 11337.868 sin 100


The .WAV format in this case is simply the raw PCM with header information. This allows the file to be played by other programs. The command generates a wave with enough length to be equivalent to 500 million samples.



Simple Volume Control


Volume control just means that each sample gets multiplied by a volume scaling factor. The simplest form to do this task is to iterate through the samples and multiply each by the same number, which for this assignment is between 0 and 1.000. The following code is my implementation for such basis.




Before getting into the code, I should mention that the use of the pthread library is manly for display purposes. It is being used to keep track of the counter which iterates through the samples, without interfering with the processing. This assignment focuses on testing the performance of a single thread.

The important parts to notice are:
  • 500 million samples of type int16_t are being allocated and assigned a number between -32768 and 32767
  • The main process block iterates through all samples, multiplying each by a constant representing the volume scaling factor
  • Time is being evaluated only in the main process block
The architectures to be tested are x86_64(xerxes) and AArch64(betty).

Running the code above on xerxes multiple times, I get results similar to this:


The same code on betty:


We can verify that the set of numbers and the volume scaling factor (of 0.333) are the same for both architectures. This could be considered as the worst case scenario for each platform. The following topics will explore what are the options available to optimize the main block algorithm.


Loop-Unrolling


One simple method of optimization, is to perform Loop-Unrolling for a Guaranteed-Multiple Iterations loop. The main process block changes as the following:

for (i = 0; i < INPUT_NUM ; ) {
    input[i++] *= 0.333f;
    input[i++] *= 0.333f;
}

All it does is perform the multiplication twice per loop, thus only evaluating the conditional statement half the iterations. On betty, there was about 8% decrease in execution time, while 31% on xerxes.