le bloge de blote: 2017

Saturday, April 22, 2017

Project - Optimizing wmemchr

As a continuation from the previous post, I figured out a way to optimize the wmemchr function in glibc. I'll go through the process of figuring it all out, but unfortunately, the final optimization didn't have good enough results to be worth submitting.

First Steps

The first operation I needed to worry about was getting the values into a vector register to compare with the given character. The 'CMEQ' instruction was pointed out to me. It compares values between two vectors, setting the destination vector's lanes to all 1s if the element pair matches, or 0s if not. As SIMD instructions don't raise flags, this instruction tries to serve this purpose to accommodate the different vector elements.

I implemented this instruction to compare four pairs of 32-bit elements. One vector holds in the four lanes with the given wide character to look for, and the other holds four elements from the wide character array. The results were saved in a third vector.

Handling Vectors

Identifying which element matched was the trickiest part. It took three instructions to load the values and compare them, but I couldn't figure out an efficient way to know which vector element matched, using the SIMD instructions available. Instead, I simply opted to figure out if a 4-element set has a match, not knowing which one. That way, I wouldn't spend instructions analyzing each element individually, and would move on to the next set.

The way I did this was using an instruction called UMAXV. This instruction figures out the element with the highest value in one vector. 'U' is for unsigned, and 'V' indicates that it is a vector reduction instruction (operates in a single vector register). If there isn't a match, the highest value would be zero, and if there is one, the highest value would be four bytes of 1s. Then I pass this maximum value as an output from the inline assembly and check if it is not zero, in which case I go through the elements individually to find which one matches. Here's the final function:

wchar_t *
mynew__wmemchr (const wchar_t *s, wchar_t c, size_t n)
{

    while( n>=4 ){

        static unsigned int set;

        __asm__ __volatile__
                ("DUP v1.4s, %w[c]\n\t"
                "LD1 {v0.4s}, [%[s]]\n\t" //load values

                //if a lane is all 1s, it matches the character
                "CMEQ v2.4s, v0.4s, v1.4s\n\t"

                //"check" if there were any matches
                "UMAXV S3, v2.4s\n\t"
                //move v3.4s[0] to set
                "UMOV %w[set], v3.4s[0]\n\t"



                :[set]"=r"(set)//output
                :[s] "r" (s), [c] "r" (c)//input
                :);

        if (set){
            //found match in set
            if (s[0] == c)
                return (wchar_t *) s;
            if (s[1] == c)
                return (wchar_t *) &s[1];
            if (s[2] == c)
                return (wchar_t *) &s[2];
            if (s[3] == c)
                return (wchar_t *) &s[3];
        }

      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;

}

For some reason, the UMAXV instruction does not accept the first operand being 'v3.4s', but 'S3' means basically the same.

As we can see by comparing to the original function, I ended up adding content to it, so it makes less comparisons, but the end results weren't as great as I expected, and it is probably due to the fact that UMAXV makes even more comparisons to determine the highest number in a vector.

Testing

To test the performance of the functions, I created a test file to compare the execution time of the new function to the original one, and compiled via a makefile into executables with different optimization levels. The test allocates a wide character array of 1 billion elements, and the very last character is the one it will be looking for. The test also includes the time taken using the glibc 2.17 shared system library, which has a consistent time of 920 milliseconds. These were the results:

Conclusion

The final results showed a reasonable improvement when compiling with no optimization options and the 'O1' option, but since the system library is optimized with 'O2' or 'O3', the improvement isn't enough to submit a pull request, even more so due to the fact that it only targets one platform.

The SIMD instructions are great for processing bulk data, but I found that narrowing results from it can become quite tricky.

Friday, April 7, 2017

SPO600 Project - wmemchr

I decided trying to improve the function wmemchr on the glibc library. The function searches a wide character (c) in a wide-character array (s). It is similar to wcschr, except that it takes the array size (n).

#include <wchar.h>

       wchar_t *wmemchr(const wchar_t *s, wchar_t c, size_t n);

This functions exactly as the memchr function, but for wide characters. Wide character in C has a type of wchar_t, which is a 4-byte value (for glibc).

The existing function is fairly simple:

wchar_t *
wmemchr (const wchar_t *s, wchar_t c, size_t n)
{
  /* For performance reasons unfold the loop four times.  */
  while (n >= 4)
    {
      if (s[0] == c)
    return (wchar_t *) s;
      if (s[1] == c)
    return (wchar_t *) &s[1];
      if (s[2] == c)
    return (wchar_t *) &s[2];
      if (s[3] == c)
    return (wchar_t *) &s[3];
      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;
}

It first tries to do as few comparisons as possible by doing loop unrolling, checking 4 values each loop. Then, if the wide character wasn't found, it looks for it in the last remaining samples, if any. Otherwise, return NULL, indicating the character wasn't found.

Setting up

To be able to test improvements, I created a test file that generates an array of 1 billion wchar_t characters by allocating it on the heap. The very last character is different from all the previous ones, and that's the one we'll be looking for.

I copied the function to the file, and gave it a different name. Then I performed 2 speed tests to look for the character. The first test uses the shared object in the system and the second uses the function I copied from glibc. I did that to have an idea of what optimization options were being used for that function when the system library was compiled. Here are the tests made on an Aarch64 system (betty).

The table illustrates the time comparison between executions of the same function using different compilation options and the system library. We see that O2 and O3 took basically the same time as the system, so we can assume that the C library was compiled with at least O2.

Furthermore, neither the O3 executable or the system library have the wmemchr function use vectorization. When I do 'objdump -d /lib64/libc.so.6 | less' and see the assembly code for the function, I see no use use vectors, but neither does the O3 executable for my tests. This means I should at least assume that O2 was used, but not necessarily O3.

Next Steps

Next, I'll try to introduce inline assembly into the code, using vectorization. The idea is to replace the contents of the first while loop, since most of the processing is done there, and figure out a way to process 4 32-bit wide values at once, using SIMD instructions for the NEON 128-bit extension found in the ARM Cortex-A series.

The tricky part is not comparing the 4 values, but actually having the assembler instructions return the one that matches between the 4, without making too many comparisons, which would impact on the performance. The approach I'm thinking of is to quickly figure out if a set of values contains the one I'm looking for. If it doesn't, just get the next set of values to avoid wasting precious time. But if it does, then it can take as many instructions necessary, because the value is already close enough.

Tuesday, April 4, 2017

qsort - The useful but hard to master glibc function

Understanding qsort

For my project I decided to give a try in optimizing the qsort function in glibc. The function sorts an array of nmemb elements, each of size size. The interesting feature of the function is that it can be used to sort any type of array. One of the function arguments is a pointer to a comparison function, which the programmer creates. This allows sorting integers, strings, or even custom structs. Here's the prototype:

 #include <stdlib.h>  
   
     void qsort(void *base, size_t nmemb, size_t size,  
          int (*compar)(const void *, const void *));  
   
     void qsort_r(void *base, size_t nmemb, size_t size,  
          int (*compar)(const void *, const void *, void *),  
          void *arg);

The base argument receives an array, and the comparison function compar returns an integer and must receive two void pointers to the elements to be compared. The comparison function must return a negative number if the elements are out of order or a positive number if the elements are in order. If the integer returned is zero, the ordering of the two elements in the array is unpredictable.

 int  
 compare_doubles (const void *a, const void *b)  
 {  
  const double *da = (const double *) a;  
  const double *db = (const double *) b;  
  return (*da > *db) - (*da < *db);  
 }

Another example would be using an existing function, like strcmp. Although this function doesn't receive void pointer arguments, it can be wrapped inside another function. This is an example from the qsort man page:

 #include <string.h>  
   
     static int  
     cmpstringp(const void *p1, const void *p2)  
     {  
       /* The actual arguments to this function are "pointers to  
        pointers to char", but strcmp(3) arguments are "pointers  
        to char", hence the following cast plus dereference */  
   
       return strcmp(* (char * const *) p1, * (char * const *) p2);  
     }

For a given array of doubles called arr, of size NUM_ELEM, the qsort function would be called like so:

   qsort( arr, NUM_ELEM, sizeof(double), compare_doubles);

I haven't yet mentioned the qsort_r alternative. This is an implementation derived from BSD to make the qsort reentrant safe. The additional argument serves to pass a mutable state to the function (e.g global variable) and implies that your comparison function must also be reentrant safe.

Under the hood

As most things in life, the deeper you go, the uglier it gets. To understand how qsort is being called, we need to dig into the library. The functions qsort and qsort_r are defined in a file called msort.c (why?) in the standard library.

The function qsort simply calls qsort_r with a last argument of NULL. qsort_r starts by checking sizes, determining if a temporary array is being allocated in the stack or heap, if there is enough memory to allocate on the heap, and other checks to determine how and where the given array will be processed. For example, if memory requirements are too high, sorting will be done on the array itself instead of creating a temporary copy.

From what I understood, there are two possible functions that can be called to sort the given array, depending on the size of the array and the available memory on the computer. The two functions are __quicksort and msort_with_tmp. Neither of those are documented at all and only the first has comments on the source. The comments on qsort_r imply that __quicksort is slower and is only used if a temporary array couldn't be allocated in memory, as it "doesn't need a temporary array". For any other case, msort_with_temp is used to indirectly sort the array.

Optimiz(ing|able\?)

After going through all that horrible looking code, I started to extract the pieces I thought necessary to do some benchmarking and profiling. There are so many moving pieces that it became really hard to just set up a development environment to work on. With no success, my only other option would be to work on the code itself, but that would have two major disadvantages: I wouldn't be able to do profiling on the different sections of the code; Every change would need to undergo a complete glibc compilation. So aside from getting vague performance results, each change to the code would take 10 minutes or more to be tested.

Now, even assuming I had none of the previous problems, and that I could have set up a development environment, there is the question of the possibility for optimization. Sorting is a deeply studied subject in computer science, and even then it isn't as complicated as actually implementing it to support any data type or structure, determining what sorting algorithm is best for a given situation at runtime, being reentrant safe and standards compliant, and making it perform as fast as possible, all while being portable between various architectures. It is a lot to take on, and some have. From the comments on the source, we can see that a lot of the code was referenced from books and other sources in which the authors extensively studied the subject, so the code might already be as optimized as possible.

So why did I choose qsort? My original idea was to profile the "function" and determine at which step it was taking the most time on. Then focus on that part and maybe implement it in A64 assembly using vectorization. Unfortunately, I couldn't get to really test it, so I'm moving on to another function.

It also didn't help that this code was heavily "patched" over the years. There is so much going on just to determine what sorting algorithm to call. It feels like this set of functions fail to follow part of the Unix philosophy of "doing one thing" and just focus on the "doing it well" part. I say it was "patched" because we can see how all this originated. Also, here is a more candour informal perspective on the matter.

Thursday, March 30, 2017

Lab 7 - Brief LAME assembler usage

The LAME is an MP3 encoder, despite the originally assigned recursive acronym standing for 'LAME Ain't an MP3 Encoder'. It claims to be an educational too for learning about MP3 encoding and it has a goal of improving the quality and performance of it. Interestingly, LAME is distributed by the developers only as source code with the purpose of being used merely as an educational tool, which in turn avoids patent infringement. Still, the developers advise acquiring a license before deploying projects implementing any part of compiled LAME code.

MP3

MP3 (aka MPEG-1, MPEG-2 Audio Layer III) is a digital audio format that reduces the size of original audio files by using lossy audio compression algorithms. The format takes advantage of the frequency range in the limited human sound perception (psychoacoustic model) and frequency resolution, i.e noticing changes in frequency. The compression discards or reduces those inaudible components.

The MPEG-1 standard provides example psychoactive models among other specifications, but doesn't provide precise specifications for the encoder, leaving developers to create their own algorithms to remove undesired information from a sound file. This led to the creation of many MP3 encoders, each generating outputs of different quality. For example, some encoders were better at dealing with low bit-rate files, while others (e.g LAME) were better at higher bit-rates. Eventually, LAME evolved and became widely used.

Encoding

Encoding MP3 requires dealing a lot with frequency manipulation involving many samples, and ideally being able to process those samples as fast as possible. On software, this can be done by implementing algorithms like fast Fourier Transform which converts a signal represented in the time domain to the frequency domain, allowing a much more convenient way to analyse manipulate data. Once there is a concrete basis theory that can be used with discrete data, the next level of optimization is to explore the hardware features available to reduce the time taken to process it. LAME is able to make use of hardware features to optimize encoding.

Assembly Usage

LAME extensively uses assembly. It specifically targets x86_64 architecture and can make use of features such as MMX, SSE, SSE2 and '3DNow!'. Although there's very little inline assembly, there are various nasm files that state functions which are called in C code.

The assembly code is used for checking cpu features, performing fast Fourier transforms using different features (depending on what's available) and other discrete calculus operations.

I couldn't compile the source to Aarch64.

Conclusion

LAME has been in development for a long time and it's widely used in many OS's. Their website says: "LAME compiles on Windows, DOS, GNU/Linux, MacOS X, *BSD, Solaris, HP-UX, Tru64 Unix, AIX, Irix, NeXTstep, SCO Unix, UnixWare, Ultrix, Plan 9, OpenVMS, MacOS Classic, BeOS, QNX, RiscOS, AmigaOS, OS/2, SkyOS, FreeMiNT(Atari) and probably a few more". It is really a project that has been optimized and ported a lot, which is a benefit for the user, but it immensely increases the complexity when compiling the code. Furthermore, the assembly code limits the usage of the program's optimizations on different platforms, in which equivalent features may be available.

Lab 7 - Inline A64 Assembler

This lab will explore the benefits of using inline assembly specifically on the Aarch64. If writing assembly was as easy as writing C (and for some people is), C would be obsolete. Thankfully, C is able to provide abstraction so we don't have to deal with the processor specific instructions, and makes code portable between processors, but there are times that it just can't make full use of the processor's features that could significantly improve the performance of a certain process. At least not without a lot of tinkering to make the compiler understand that it should use those features. The most straight forward way to use those features, while having the abstraction of C, is to use inline assembly, which just embeds architecture specific assembly code within the C source code.

For demonstrating the implementation of inline assembly, I'll be optimizing the code from lab 5 where 500 million 16-bit sound samples were modified by a volume scaling factor. The final solution I had come up with was the following:

 float vol = 0.333;  
   
 // converting float into uint16_t: 1 whole, 15 decimals  
 uint16_t vol16 = (int)(vol * pow(2,15));  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 for (i = 0; i < INPUT_NUM ; i++)   
   input[i] = ((input[i] * vol16) >> 15);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////

This solution involves converting the float volume scaling factor into a fixed point format and then performing the multiplication. The results for compiling without any options for 500 million samples was very close to 4 seconds (3959 ms), and when compiling with the -O3 option, the time was cut down to 362ms.

Implementing inline assembly

I decided to create a more general function to introduce the inline assembly into the code. The code block above turned into this:

 float vol = 0.333;  
   
 ///// main process block /////  
 gettimeofday(&t1, NULL);  
   
 adjustVolume(input, vol, INPUT_NUM);  
   
 gettimeofday(&t2, NULL);  
 ///////////////////////////////

And the final function is as follows:

 void adjustVolume(int16_t* samples, float vsf, unsigned long size){  
   
   unsigned long nli;  
   int samples_left;  
   int16_t* i;  
   
   //convert float to UQ1.15 fixed point  
   uint16_t vol16 = (int)(vsf * pow(2,15));  
   
   //see if there's left over values  
   samples_left = size % 8;//8 samples at a time   
   
   //loop until before non loadable index  
   nli = size - samples_left;  
   int16_t* p_nli = &samples[nli];  
   
   for(i = samples; i < p_nli ; i+=8){ //8 samples   
   
     __asm__("DUP v1.8h, %w[volume]\n\t" //load volume into v1  
         "LD1 {v0.8h}, [%[smp_ptr]]\n\t" //load 8 16-bit samples in v0  
         "SQDMULH v0.8h, v0.8h, v1.8h\n\t" //adjust 8 samples at a time; save in v0  
         "ST1 {v0.8h}, [%[smp_ptr]]\n\t" //store back into samples array  
   
         :  
   
         : [volume] "r" (vol16), [smp_ptr]"r"(i)//input  
   
         :  
   
         );  
   
   }  
   
   while(samples_left){  
   
   
     samples[size-samples_left-1] = ((samples[size-samples_left-1] * vol16) >> 15);  
   
     samples_left--;  
   
   }  
   
 }

The function starts by converting the volume scaling factor float into a fixed point, for making the load of the value more convenient. Getting into the assembly, we give the converted 16-bit volume and the pointer to the array that increments by 8 16-bit elements each loop. Dup and LD1 loads the volume and sample pointer into different vectors. SQDMULH multiplies those vectors, which creates a 32-bit value, and overwrites the samples vector with the high half of the result. ST1 then stores the adjusted samples into the array in allocated in memory.

To have a better idea of what is going on, here's a visualization using gdb:

Commands:

ni: next instruction
x/h $[register]: show contents in memory (x) pointed by value in register, format as half-word (h)
p $[register]: print value in register. I the case above, only showing half-word format

At the top, we can see that the inline assembly remained intact on compilation. On the commands, we can see a value of the array in memory and then the same value printed as a signed integer in the vector register v0. v1 holds the fixed point volume scaling factor (10911), which is equivalent to 0.333 (getting close to, at least). After executing the multiplication, we see that v0 now holds the adjusted values, and one instruction further, those are stored in memory.

Wrapping the function, is a loop controlled by an int16_t pointer that skips 8 elements each loop, and exits when there aren't 8 elements to be loaded. The function checks for such cases beforehand, to minimize comparisons in each loop. Worst cases would be if size % 8 = 7, and those last samples are individually adjusted after the main process.

Results!

Finally, this section compares the results from the implementation on lab 5, simply using the fixed point implementation on 500 million samples, to this lab's implementation, using inline assembly on 500 000 007 samples, just to get the worst case possible. Although, those last samples make almost no difference to the result.

So, huge improvement using the inline assembly with no compiler options, although no improvement on the optimized compilation. This is probably due to the extra overhead in the function, and maybe those extra samples made a dent after all, but that was to be expected, since -O3 enables the auto-vectorization option. Still, compiling the inline assembly code with -O1 already gives results very close to the -O3 option (372ms), while the older version at -O1 gives results around 760ms.

Conclusion

Using inline assembly allows us to make bare metal optimizations for the parts where performance really matters. For single operations, there's no real need to use assembly, so we can enjoy the abstraction C provides for most of the tasks, and hand the heavy lifting to assembly when necessary.

Another benefit is that we can optimize code without needing to submit all of it to compiler options, which can generate unexpected results (specially when multi-threading).

Wednesday, March 22, 2017

Lab 6 - Auto-Vectorization with gcc

This is an exercise on Auto-Vectorization. The purpose is to analyse a simple source code's compilation through its disassembly and modify the source in order to aid the compiler in making a more efficient binary.

The source was created by following these guidelines. The following functions take place in the main file and are responsible for adding pairs of array elements and saving into a third array, and adding all elements of an array into a single value, respectively.


//adds pairs of array elements
void addArr(int* p3, int* p1, int* p2, int length){

    int i;
    
    for(i=0; i<length; i++)
        p3[i] = p1[i] + p2[i];
        
}

//adds elements of an int array
long addElems(int *arr , int length){

    int i;
    long sum = 0;

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}

By compiling the main file on an Aarch64 system with no compiler options, we get the following disassembly for the first function:


0000000000400660 <addArr>:
  400660:       d100c3ff        sub     sp, sp, #0x30 --> Creating Stack
  400664:       f9000fe0        str     x0, [sp,#24] --> Saving p3 pointer
  400668:       f9000be1        str     x1, [sp,#16] --> Saving p1 pointer
  40066c:       f90007e2        str     x2, [sp,#8] --> Saving p2 pointer
  400670:       b90007e3        str     w3, [sp,#4] --> Saving length int
  400674:       b9002fff        str     wzr, [sp,#44] ---> assigning zero to i 
  400678:       14000014        b       4006c8 <addArr+0x68> --->Start of loop
  40067c:       b9802fe0        ldrsw   x0, [sp,#44] ---> load i into r0
  400680:       d37ef400        lsl     x0, x0, #2 --> Quadruples i (int = 4 bytes)
  400684:       f9400fe1        ldr     x1, [sp,#24] --> loads p3 (pointer) into r1
  400688:       8b000020        add     x0, x1, x0 --> offsets p3 pointer by i, save in r0
  40068c:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  400690:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  400694:       f9400be2        ldr     x2, [sp,#16] --> load p1 (pointer) into r2
  400698:       8b010041        add     x1, x2, x1 --> offsets p1 pointer by i, save in r1
  40069c:       b9400022        ldr     w2, [x1] --> load p1 int value into r2
  4006a0:       b9802fe1        ldrsw   x1, [sp,#44] --> load i into r1 (need to keep r0) (again)
  4006a4:       d37ef421        lsl     x1, x1, #2 --> quadruples i (again)
  4006a8:       f94007e3        ldr     x3, [sp,#8] --> load p2 (pointer) into r3
  4006ac:       8b010061        add     x1, x3, x1 --> offsets p2 pointer by i, save in r1
  4006b0:       b9400021        ldr     w1, [x1] --> load p2 int value into r1
  4006b4:       0b010041        add     w1, w2, w1 --> finally adds pair of elements, save in r1
  4006b8:       b9000001        str     w1, [x0] --> store into p3 array
  4006bc:       b9402fe0        ldr     w0, [sp,#44] \
  4006c0:       11000400        add     w0, w0, #0x1  --> Inc i
  4006c4:       b9002fe0        str     w0, [sp,#44] /
  4006c8:       b9402fe1        ldr     w1, [sp,#44] ---> loads i into r1
  4006cc:       b94007e0        ldr     w0, [sp,#4] ---> loads length int into r0
  4006d0:       6b00003f        cmp     w1, w0  ---> i - length
  4006d4:       54fffd4b        b.lt    40067c <addArr+0x1c> ---> if negative, loop again
  4006d8:       9100c3ff        add     sp, sp, #0x30 ---> restores stack pointer
  4006dc:       d65f03c0        ret   ---> return to caller

As we can see from the code, there is no vectorization happening. The function is simply getting one addition value per loop. Also, we notice that the program is using the registers very conservatively. It stores all values into RAM and never gets to use the r4 register, in order to "play safe" in relation to other calls in the program. Using RAM is extremely slow to the CPU.

Now using the '-O1' compiler option:


0000000000400660 <addArr>:
  400660:       6b1f007f        cmp     w3, wzr  --> w3 originally has 'length'
  400664:       5400018d        b.le    400694 <addArr+0x34>
  400668:       51000466        sub     w6, w3, #0x1
  40066c:       910004c6        add     x6, x6, #0x1
  400670:       d37ef4c6        lsl     x6, x6, #2
  400674:       d2800003        mov     x3, #0x0                        // #0
  400678:       b8636825        ldr     w5, [x1,x3]
  40067c:       b8636844        ldr     w4, [x2,x3] --> r1 and r2 have the pointers to p1 and p2
  400680:       0b0400a4        add     w4, w5, w4
  400684:       b8236804        str     w4, [x0,x3] --> r0 has pointer to p3
  400688:       91001063        add     x3, x3, #0x4
  40068c:       eb06007f        cmp     x3, x6
  400690:       54ffff41        b.ne    400678 <addArr+0x18>
  400694:       d65f03c0        ret

We can see that it processes the results in a more direct way and minimally uses the RAM, while not allocating any stack. Still, the compiler wouldn't vectorize the code, as auto-vectorization is not enabled. In order to enable it, we either need to use -O3, which bundles various flags, or just by adding the single '-ftree-vectorize' flag when compiling.

Let's just try using the single flag, while on -O1:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 540007ad  b.le 400758 <addArr+0xf8>
  400668: 91004004  add x4, x0, #0x10
  40066c: eb04003f  cmp x1, x4
  400670: 1a9f37e6  cset w6, cs
  400674: 91004025  add x5, x1, #0x10
  400678: eb05001f  cmp x0, x5
  40067c: 1a9f37e5  cset w5, cs
  400680: 2a0500c5  orr w5, w6, w5
  400684: eb04005f  cmp x2, x4
  400688: 1a9f37e6  cset w6, cs
  40068c: 91004044  add x4, x2, #0x10
  400690: eb04001f  cmp x0, x4
  400694: 1a9f37e4  cset w4, cs
  400698: 2a0400c4  orr w4, w6, w4
  40069c: 6a0400bf  tst w5, w4
  4006a0: 54000460  b.eq 40072c <addArr+0xcc>
  4006a4: 71000c7f  cmp w3, #0x3
  4006a8: 54000429  b.ls 40072c <addArr+0xcc>
  4006ac: 53027c66  lsr w6, w3, #2
  4006b0: 531e74c7  lsl w7, w6, #2
  4006b4: 34000387  cbz w7, 400724 <addArr+0xc4>
  4006b8: d2800004  mov x4, #0x0                    // #0
  4006bc: 2a0403e5  mov w5, w4
  4006c0: 8b040028  add x8, x1, x4
  4006c4: 4c407901  ld1 {v1.4s}, [x8]
  4006c8: 8b040048  add x8, x2, x4
  4006cc: 4c407900  ld1 {v0.4s}, [x8]
  4006d0: 4ea08420  add v0.4s, v1.4s, v0.4s --> Vectorization!
  4006d4: 8b040008  add x8, x0, x4
  4006d8: 4c007900  st1 {v0.4s}, [x8]
  4006dc: 110004a5  add w5, w5, #0x1
  4006e0: 91004084  add x4, x4, #0x10
  4006e4: 6b0600bf  cmp w5, w6
  4006e8: 54fffec3  b.cc 4006c0 <addArr+0x60>
  4006ec: 1400000a  b 400714 <addArr+0xb4>
  4006f0: 937e7c85  sbfiz x5, x4, #2, #32
  4006f4: b8656827  ldr w7, [x1,x5]
  4006f8: b8656846  ldr w6, [x2,x5]
  4006fc: 0b0600e6  add w6, w7, w6
  400700: b8256806  str w6, [x0,x5]
  400704: 11000484  add w4, w4, #0x1
  400708: 6b04007f  cmp w3, w4
  40070c: 54ffff2c  b.gt 4006f0 <addArr+0x90>
  400710: 14000012  b 400758 <addArr+0xf8>
  400714: 2a0703e4  mov w4, w7
  400718: 6b07007f  cmp w3, w7
  40071c: 54fffea1  b.ne 4006f0 <addArr+0x90>
  400720: 1400000e  b 400758 <addArr+0xf8>
  400724: 52800004  mov w4, #0x0                    // #0
  400728: 17fffff2  b 4006f0 <addArr+0x90>
  40072c: 51000466  sub w6, w3, #0x1
  400730: 910004c6  add x6, x6, #0x1
  400734: d37ef4c6  lsl x6, x6, #2
  400738: d2800003  mov x3, #0x0                    // #0
  40073c: b8636825  ldr w5, [x1,x3]
  400740: b8636844  ldr w4, [x2,x3]
  400744: 0b0400a4  add w4, w5, w4
  400748: b8236804  str w4, [x0,x3]
  40074c: 91001063  add x3, x3, #0x4
  400750: eb06007f  cmp x3, x6
  400754: 54ffff41  b.ne 40073c <addArr+0xdc>
  400758: d65f03c0  ret

We finally get vectorized operations! Although it does an awful lot of comparisons, probably to check alignment, resulting in this 64-line code.

As most modern processors align variables, we can pretty much tell the compiler to assume that they are, in fact, aligned. Following some techniques in this article, our function now looks like this:


addArr(int *__restrict p3, int *__restrict p1, int *__restrict p2, int length){
    int i;
    int *arr3 = __builtin_assume_aligned(p3, 16);
    int *arr2 = __builtin_assume_aligned(p2, 16);
    int *arr1 = __builtin_assume_aligned(p1, 16);

    for(i=0; i<length; i++)
        arr3[i] = arr1[i] + arr2[i];
}

The '__builtin_assume_aligned' function is hinting the compiler that the values are at least 16-byte aligned, and the '__restrict' keyword is telling it that the pointers are not overlapping into different regions of memory at runtime. All this allows the compiler to perform some optimizations, that result in this shorter code:


0000000000400660 <addArr>:
  400660: 6b1f007f  cmp w3, wzr
  400664: 5400046d  b.le 4006f0 <addArr+0x90>
  400668: 53027c66  lsr w6, w3, #2
  40066c: 531e74c5  lsl w5, w6, #2
  400670: 340003c5  cbz w5, 4006e8 <addArr+0x88>
  400674: 71000c7f  cmp w3, #0x3
  400678: 54000389  b.ls 4006e8 <addArr+0x88>
  40067c: d2800004  mov x4, #0x0                    // #0
  400680: 2a0403e7  mov w7, w4
  400684: 8b040008  add x8, x0, x4
  400688: 8b04002a  add x10, x1, x4
  40068c: 8b040049  add x9, x2, x4
  400690: 4c407941  ld1 {v1.4s}, [x10]
  400694: 4c407920  ld1 {v0.4s}, [x9]
  400698: 4ea08420  add v0.4s, v1.4s, v0.4s    ---> Still vectorizing!
  40069c: 4c007900  st1 {v0.4s}, [x8]
  4006a0: 110004e7  add w7, w7, #0x1
  4006a4: 91004084  add x4, x4, #0x10
  4006a8: 6b0700df  cmp w6, w7
  4006ac: 54fffec8  b.hi 400684 <addArr+0x24>
  4006b0: 1400000a  b 4006d8 <addArr+0x78>
  4006b4: 937e7c85  sbfiz x5, x4, #2, #32
  4006b8: b8656827  ldr w7, [x1,x5]
  4006bc: b8656846  ldr w6, [x2,x5]
  4006c0: 0b0600e6  add w6, w7, w6
  4006c4: b8256806  str w6, [x0,x5]
  4006c8: 11000484  add w4, w4, #0x1
  4006cc: 6b04007f  cmp w3, w4
  4006d0: 54ffff2c  b.gt 4006b4 <addArr+0x54>
  4006d4: 14000007  b 4006f0 <addArr+0x90>
  4006d8: 2a0503e4  mov w4, w5
  4006dc: 6b05007f  cmp w3, w5
  4006e0: 54fffea1  b.ne 4006b4 <addArr+0x54>
  4006e4: 14000003  b 4006f0 <addArr+0x90>
  4006e8: 52800004  mov w4, #0x0                    // #0
  4006ec: 17fffff2  b 4006b4 <addArr+0x54>
  4006f0: d65f03c0  ret

Applying the same techniques to the 'addElems' function:


long addElems(int *__restrict p_arr , int length){

    int i;
    long sum = 0;
    int *arr = __builtin_assume_aligned(p_arr, 16);

    for (i=0; i<length; i++)
        sum += arr[i];

    return sum;
}

And we also get a vectorized and relatively short assembly code.

Conclusion

As GCC supports many different processors and architectures, it needs to maintain stability between these different platforms. This results in a lot of the optimizations not taking place automatically and leaves a lot of it up to the programmer to figure out.

Using vectorization can grant huge performance boosts and there are a myriad of existing software that don't take advantage of it, sometimes opting for multi-threaded solutions instead. With that said, there's a great untouched potential to improve these software with vectorization and that might lead for better auto-vectorization support on part of the compiler, and CPU manufacturers.

Saturday, March 4, 2017

Lab 5 - Compiler Optimization and Algorithm Selection - Part 2

Using floating point for the volume scaling factor

From our original example, we were using a floating point value to represent the volume scaling factor, and then multiplying each 16-bit integer sample by it. From a higher level language such as C, it makes a lot of sense of how we should manage to get the result. Simply multiply both values and extract the whole number part. Let's see how the computer actually perceives it.


        /////  main process block /////
        gettimeofday(&t1, NULL);
  400c51:       be 00 00 00 00          mov    $0x0,%esi
  400c56:       bf 90 20 60 00          mov    $0x602090,%edi
  400c5b:       e8 20 fb ff ff          callq  400780 

        for (i = 0; i < INPUT_NUM ; i++) {
  400c60:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
  400c67:       00 
  400c68:       eb 4b                   jmp    400cb5 
                input[i] *= 0.5;
  400c6a:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400c6e:       48 8d 14 00             lea    (%rax,%rax,1),%rdx
  400c72:       48 8b 45 d0             mov    -0x30(%rbp),%rax
  400c76:       48 01 d0                add    %rdx,%rax
  400c79:       48 8b 55 a0             mov    -0x60(%rbp),%rdx
  400c7d:       48 8d 0c 12             lea    (%rdx,%rdx,1),%rcx
  400c81:       48 8b 55 d0             mov    -0x30(%rbp),%rdx
  400c85:       48 01 ca                add    %rcx,%rdx
  400c88:       0f b7 12                movzwl (%rdx),%edx
  400c8b:       0f bf d2                movswl %dx,%edx
  400c8e:       66 0f ef c0             pxor   %xmm0,%xmm0
  400c92:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
  400c96:       f2 0f 10 0d 7a 04 00    movsd  0x47a(%rip),%xmm1        # 401118 <__dso_handle x140="">
  400c9d:       00 
  400c9e:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
  400ca2:       f2 0f 2c d0             cvttsd2si %xmm0,%edx
  400ca6:       66 89 10                mov    %dx,(%rax)


        /////  main process block /////
        gettimeofday(&t1, NULL);

        for (i = 0; i < INPUT_NUM ; i++) {
  400ca9:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400cad:       48 83 c0 01             add    $0x1,%rax
  400cb1:       48 89 45 a0             mov    %rax,-0x60(%rbp)
  400cb5:       48 8b 45 a0             mov    -0x60(%rbp),%rax
  400cb9:       48 3d ff 64 cd 1d       cmp    $0x1dcd64ff,%rax
  400cbf:       76 a9                   jbe    400c6a 
                //input[i] >>= 1;
                input[i] *= 0.5;


        }
        gettimeofday(&t2, NULL);
  400cc1:       be 00 00 00 00          mov    $0x0,%esi
  400cc6:       bf b0 20 60 00          mov    $0x6020b0,%edi
  400ccb:       e8 b0 fa ff ff          callq  400780 
        ///////////////////////////////

That's a lot of instructions for one simple loop!

When getting into the for loop, the CPU first loads zero into 'i'(%rbp-0x60) and jumps down to check if it is greater than INPUT_NUM-1. Afterwards, it jumps back up and loads the input sample value into %rdx based on the i offset. Then it converts it to a 16-bit signed integer, so it has the higher order bits zeroed out, and does another conversion to a 64-bit double precision floating point while loads it into the higher part of the 128-bit xmm0 register! A lot of conversions happened just to be able to multiply a 16-bit value with a float value.

After getting the sample into xmm0, the program then gets the volume scaling factor into xmm1, and multiply both registers. Then the program then converts the result in xmm0 to a 32-bit signed integer and saves it into %edx, in which %dx is then moved to the input array on the address specified by %rax.

That's a lot of extra work to get the multiplication of two numbers. So it doesn't lose precision, the compiler prefers going through all this conversion. We could modify the program so the values to be multiplied are compatible enough to not need any conversion, but the volume scaling factor is supposed to be between 0.000 and 1.000, so how can we represent decimal numbers in a 16-bit format?

Fixed-Point Numbers

Fixed-point numbers is a way to represent fractional numbers in binary. Here's how it works:

On this one-byte representation, we can see that, by implying a radix point in the middle, we are assigning 4 bits to represent the whole number part and 4 bits to represent the fractional part.

Naturally, there's a huge loss of many possible numbers that could be represented if using FPUs. We can know the resolution of the fractional part by finding out the lowest possible value that can be incremented. On the example above, it would be 2^-4 = 1/16 = 0.0625

Fixed-point numbers are represented with the Qm.n format notation, where m is the number of bits for the whole number part and n is the number of bits for the fractional part. If there is just one number after the Q, it is assigned to n. additionally, unsigned numbers are represented with an U prefixed to the Q. The example above has a Q notation of UQ4.4 .

Implementation

Fixed-point number representation is manly a theory, meaning that the computational implementations are subject to context. There aren't any special calls to make the computer know that a number of bits of a certain byte are actually representing the fractional part.

In our case, we also need to be able to convert a given volume scaling factor between 0.000 and 1.000 to the closest possible binary value. This can be done with the Float-to-Q conversion:

Our volume scaling factor only needs a whole number range of 0 to 1 and doesn't really need to be negative, so it will be a UQ_1.15 number, having a resolution of 3.0518 x 10^-5. Once we multiply it by our sample, it will become a 32-bit value, so we need to perform a bit-shift-right operation n times.

Here's the changed code:


 float vol = 0.333;
 // converting float into uint16_t: 1 whole, 15 decimals
 uint16_t vol16 = (int)(vol * pow(2,15));

 /////  main process block /////
 gettimeofday(&t1, NULL);

 for (i = 0; i < INPUT_NUM ; i++) {
  input[i] = ((input[i] * vol16) >> 15);

 }
 gettimeofday(&t2, NULL);
 ///////////////////////////////

Results

The final test involved running the same code of the simple(original) and fixed-point implementations on the x86_64(xerxes) and Aarch64(betty) architectures. Additionally, I compiled the same sources with the -O3 option. Each of the four versions ran three times on each architecture and the average time is the final result.

On xerxes, we can notice that the fixed point implementation actually wasn't an improvement for the default compilation, but it was on the optimized one.

As for betty, there was a significant improvement with the fixed-point implementation, for both compilation options.

Conclusion

The x86_64 seemed not to accept well the fixed-point implementation unless if it were told not to worry to much about the accuracy and other processes. For example, the thread that was tracking the counter and displaying results didn't work at all, because the value on the address of 'i' wasn't being used as a counter by the CPU and it was constantly with a value of 5, while its assembly code became much longer and was jumping around, somehow making it faster.

On betty, the same thread issue happened, but all implementations were an improvement over the original. I suspect that, for ARM being very common among embedded devices, where floating point operations are not always available or convenient, the fixed-point operations ended up making a huge improvement.

Tuesday, February 28, 2017

Lab 5 - Compiler Optimization with Algorithm Selection

Optimizing code for better performance results

This assignment will test different approaches when writing code, comparing the performance. The goal is to develop an algorithm to change the constant component(e.g volume) to a series of input numbers(e.g audio samples) as fast as possible.

Ideally, to obtain a more accurate result, I would create three sample files with the Sound Exchange tool (SoX), each with different frequencies of 100Hz, 1000Hz and 10000Hz, a sample rate of 44100Hz, 16-bit signed integer per sample and 2 channels. It takes hundreds of millions of samples to get meaningful results, which would provide three files of 2GB each. Since the tests are being made on a remotely shared machined, I decided to generate the samples in the program itself, while using a fixed random seed to get the same set of samples to be compared.

The SoX command to generate each file would be as follows:

sox -n -r 44100 -b 16 -c 2 sample.wav synth 11337.868 sin 100

The .WAV format in this case is simply the raw PCM with header information. This allows the file to be played by other programs. The command generates a wave with enough length to be equivalent to 500 million samples.

Simple Volume Control

Volume control just means that each sample gets multiplied by a volume scaling factor. The simplest form to do this task is to iterate through the samples and multiply each by the same number, which for this assignment is between 0 and 1.000. The following code is my implementation for such basis.

Before getting into the code, I should mention that the use of the pthread library is manly for display purposes. It is being used to keep track of the counter which iterates through the samples, without interfering with the processing. This assignment focuses on testing the performance of a single thread.

The important parts to notice are:

500 million samples of type int16_t are being allocated and assigned a number between -32768 and 32767
The main process block iterates through all samples, multiplying each by a constant representing the volume scaling factor
Time is being evaluated only in the main process block

The architectures to be tested are x86_64(xerxes) and AArch64(betty).

Running the code above on xerxes multiple times, I get results similar to this:

The same code on betty:

We can verify that the set of numbers and the volume scaling factor (of 0.333) are the same for both architectures. This could be considered as the worst case scenario for each platform. The following topics will explore what are the options available to optimize the main block algorithm.

Loop-Unrolling

One simple method of optimization, is to perform Loop-Unrolling for a Guaranteed-Multiple Iterations loop. The main process block changes as the following:

for (i = 0; i < INPUT_NUM ; ) {
    input[i++] *= 0.333f;
    input[i++] *= 0.333f;
}

All it does is perform the multiplication twice per loop, thus only evaluating the conditional statement half the iterations. On betty, there was about 8% decrease in execution time, while 31% on xerxes.

Tuesday, January 31, 2017

Lab 4 - Compiled C Lab

This assignment deals with the experimentation of some of the many argument options available when compiling C code with GCC. A simple "Hello World" program will be used to test the different options. In order to analyze the binary executable, the following arguments will be used:

    -g              #enable debugging information
    -O0             #(Letter 'o' followed by zero) do not optimize
    -fno-builtin    #do not use builtin function optimization

The source code will be compiled under a Linux x86_64 environment and a ELF (Executable and Linkable Format) file is the resulting binary.

For analyzing the executable in a readable way, we can use the 'objdump' and 'readelf' commands. With 'objdump' we can use an argument for showing the header information for the file (-f), display summary information of the file divided by section (-s) or disassemble the executable (-d). Having the debug flag when compiling also helps when using the option to show the C source code (--source), along with the instructions in assembly (-d implied).

$ objdump --source hello
========================


00000000004003e0 <printf@plt-0x10>:
  4003e0:       ff 35 22 0c 20 00       pushq  0x200c22(%rip)        # 601008 <_GLOBAL_OFFSET_TABLE_+0x8>
  4003e6:       ff 25 24 0c 20 00       jmpq   *0x200c24(%rip)        # 601010 <_GLOBAL_OFFSET_TABLE_+0x10>
  4003ec:       0f 1f 40 00             nopl   0x0(%rax)

00000000004003f0 <printf@plt>:
  4003f0:       ff 25 22 0c 20 00       jmpq   *0x200c22(%rip)        # 601018 <_GLOBAL_OFFSET_TABLE_+0x18>
  4003f6:       68 00 00 00 00          pushq  $0x0
  4003fb:       e9 e0 ff ff ff          jmpq   4003e0 <_init+0x18>
.
.
.
00000000004004f6 <main>:
#include <stdio.h>

int main() {
  4004f6:       55                      push   %rbp
  4004f7:       48 89 e5                mov    %rsp,%rbp
            printf("Hello World!\n");
  4004fa:       bf a0 05 40 00          mov    $0x4005a0,%edi
  4004ff:       b8 00 00 00 00          mov    $0x0,%eax
  400504:       e8 e7 fe ff ff          callq  4003f0 <printf@plt>
  400509:       b8 00 00 00 00          mov    $0x0,%eax
}
  40050e:       5d                      pop    %rbp
  40050f:       c3                      retq

On the main section, we find the start of our C code, although a lot was already done by the compiler previously when analyzing the assembly code. On the left , we can see the location on the heap memory where the program is running, and the following bytes (each represented as pairs of hexadecimal digits) indicating the instruction its and arguments. On the right side, 'objdump' conveniently disassembled each series of bytes. The left side could also be obtained by using the '-s' argument with 'object' dump.

$ objdump -s hello
===================

Contents of section .rodata:
 400590 01000200 00000000 00000000 00000000  ................
 4005a0 48656c6c 6f20576f 726c6421 0a00      Hello World!..

With this output, we can see the contents of the memory used throughout the whole program. Here we can see the "Hello World!" string on the 4005a0 address, under the read-only data section. Note that this address was not shown on the previous command, although it is manually moved to the 'edi' register to be used as a destination address. Afterwards, we notice the program probably preparing a 'sys_read' syscall from standard input, when zero is assigned to the 'eax' register.

The program then calls the 'printf' function through PLT( procedure linkage table), which basically helps link the executable with the C libraries on the system.

Now let's take a look on the header section for the file.

$ readelf -h hello
===================

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x400400
  Start of program headers:          64 (bytes into file)
  Start of section headers:          8752 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         35
  Section header string table index: 32

There is some useful information here. More importantly the start of section headers number. Let's save it for later.

Performing the 'du -h hello' command, I get that it occupies 11 Kilobytes on disk.

'-static' argument

Now including the 'static' flag on the gcc command, the compiler is creating an standalone application by copying its entire library includes into the executable. This makes for a very portable application that does not depend on libraries available on the system, but also immensely increases the file size. Our simple "Hello World!" program has the whole standard C library in it, but it is using only one function of it. The executable now has almost 900Kilobytes, opposed to the previous 11K.

Using built in function optimizations

The compiler utilizes some common optimizations, by default. When we use the '-fno-builtin', we are telling the compiler not to use them. Analyzing the optimized executable, we get basically the same 'main' section. A important difference is the use of 'puts' instead of 'printf', which has formatting capabilities. Since we're only using one argument as a single string, the compiler calls for 'puts', which simply echos it to standard output.

No debugging enabled

The compiler has the option to include debugging information into the executable, so it becomes easier to analyze its assembly code. On the object dump from the original 'hello' program, it echoed the C source along with the instructions. Much of the C source is then included into the executable, therefor making it a larger file. Without the debug option, we can observe a decrease in the file size from 11K to 8.4K, as well as a decrease of section headers.

Using arguments on 'printf'

Now we're going to modify the program a little, so we can compare the changes done to it in assembly.

#include <stdio.h>

int main() {
            printf("Hello World!\n %d %d %d %d %d %d %d %d %d %d",
                                   1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

When compiled and disassembled, we have the following:

00000000004004f6 <main>:
  4004f6:       55                      push   %rbp
  4004f7:       48 89 e5                mov    %rsp,%rbp
  4004fa:       48 83 ec 08             sub    $0x8,%rsp
  4004fe:       6a 0a                   pushq  $0xa
  400500:       6a 09                   pushq  $0x9
  400502:       6a 08                   pushq  $0x8
  400504:       6a 07                   pushq  $0x7
  400506:       6a 06                   pushq  $0x6
  400508:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40050e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  400514:       b9 03 00 00 00          mov    $0x3,%ecx
  400519:       ba 02 00 00 00          mov    $0x2,%edx
  40051e:       be 01 00 00 00          mov    $0x1,%esi
  400523:       bf d0 05 40 00          mov    $0x4005d0,%edi
  400528:       b8 00 00 00 00          mov    $0x0,%eax
  40052d:       e8 be fe ff ff          callq  4003f0 <printf@plt>
  400532:       48 83 c4 30             add    $0x30,%rsp
  400536:       b8 00 00 00 00          mov    $0x0,%eax
  40053b:       c9                      leaveq
  40053c:       c3                      retq
  40053d:       0f 1f 00                nopl   (%rax)

Here we can see that the 'main' section continues pretty much the same logic. What is very noticeable is the assignment of the digits to the registers and stack. Up to five arguments can be assigned to the registers to be used by the 'printf' function, but if more than that is needed, it will be pushed to the stack. The arguments that are stored on stack are assigned in the opposite order than in the program so that 'printf' pops the top value first. In this case, after the values on the registers substitute the '%d' in memory, the top value would be 6.

After 'printf' is executed, there arguments values are still on the stack. 'leaveq' takes care of that by assigning the stack pointer as base pointer, so the next values written on the stack will overwrite the previous ones.

Output function

By moving 'printf' to a separate function and calling that function in 'main', the 'main' section now calls the address of a new section called 'output', and in it 'printf' is called. With the default options, the compiler would optimize the instructions so that it functions like our original C code, but still there would be a new section for 'output'.

Full Optimization

For full optimization on compilation, the compiler must receive the '-O3' option, which means to fully optimize it, even if the output executable is not stable. Compiling our original "Hello World!" program we get this:

0000000000400400 <main>:
  400400:       48 83 ec 08             sub    $0x8,%rsp
  400404:       bf b0 05 40 00          mov    $0x4005b0,%edi
  400409:       31 c0                   xor    %eax,%eax
  40040b:       e8 e0 ff ff ff          callq  4003f0 <printf@plt>
  400410:       31 c0                   xor    %eax,%eax
  400412:       48 83 c4 08             add    $0x8,%rsp
  400416:       c3                      retq
  400417:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40041e:       00 00

Compared to the original code, it seems more efficient, but also harder to read. One simple optimization made is setting 'eax' to zero by using the Exclusive OR operation with itself, instead of moving a zero value to the register.

Conclusion

The compiler offers many debugging and optimization options, although for today's programs those optimizations are pretty much done automatically. As programs get larger and more complex, there's is only so much the compiler can optimize. There is a need for optimization for programmers, as they have the context necessary to determine the common processing of a program that the compiler couldn't.

Sunday, January 29, 2017

Lab 3 - Assembler Lab

This assignment requires the use of the Assembly language to display and increment a value, using a loop. The code should exist for both, the x86_64 and Aarch64 architectures. A sample initial loop was provided in GNU Assembler (GAS) and when compiled and executed, displays the following output:

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

The next objective is to modify the code so that this output is displayed:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Our group's solution is the following:

The first step was to create the address locations containing the data we want to display (Lines 45-54). For our solution, we created three values we want to display. The 'msg' value contains the string 'Loop: ' and its length is specified by " . (dot) - (minus) msg", which means, this current address minus the length of msg. The 'digit' value will be used to display the increasing digit. It is declared as an empty space, but will be changed in the code for a single digit. The 'nl' value is simply to add a newline character after each iteration. Each of these values correspond to an address in memory that will later be accessed.

Afterwards, we can analyze the section that will print the 'msg' value, but before we get into the code, we need to understand 'syscall' a little better. Syscall is what is used to perform system calls while executing (e.g writing, reading, exiting, etc). It uses the registers rdi, rsi, rdx, rcx, r8d, r9d, in this specific order, when executing a system call specified by rax. In this assignment, only the 'sys_write' system call will be used and it has a code '1'. The rdi register is the destination index and it should point to the standard output on the console, therefor it should hold a value of 1. The rsi contains the data to be used as source when writing, so it should point to one of the message values we specified earlier. The last argument 'sys_write' needs is the length of the message, which was declared as len, len1 and len2, and can be stored on rdx. The final system call should be something similar to this:

                sys_write( 1 ,"Loop: ", len)

Of course, the assembly language isn't so simple as to just throw some arguments into a function. Each argument has to be placed in its right registry. Analyzing the lines 14 to 19, we start by allocating the length of msg to rdx. Then we assign 'msg' to rsi, so it can be used as source data. Next, we change rdi to 1, so we write to stdout on the console, and change rax to 1, so we call 'sys_write' on the pre-assembled (!!) registers. After we have all registers assigned, 'syscall' on line 19 will perform the system call and write "Loop: " once to the console.

The next block of code (21-28) has the logic for the single digit. The writing process is the same as the previous block of code. What is interesting about this block is how the digit to be written changes on each iteration of the loop. The first change we notice is on line 24. As we know from the sample loop, there is a loop counter on register r15. It starts at zero and is incremented by one at the end of each iteration. To display the values from 0 to 9, we simply need to assign each current value of the loop counter to rsi and convert it to ASCII. The 'movb' instruction moves the lower byte of r15 to rsi and, on the next line, we convert it to ASCII by adding 48 to it. We then write to the console.

So this concludes the first part of the assignment. The next is to display the numbers from 0 to 30, suppressing the leading zero. Here's my solution:

Now there are two digits to be displayed and if the leading digit is zero, it should be omitted. In order to know when 10 is reached, we'll divide the counter by ten. The quotient is used as the leading digit and the remainder is used as the following digit. The 'div' instruction stores the quotient in the rax register and the remainder on the rdx register, but for each division, rdx must be zeroed out as shown on line 21. After dividing the counter by ten, we than store the quotient on r13 and the remainder on r14, as those registers preserve their values.

As 'msg' is already printed, we now need to print the leading character, if not zero, so we use the 'cmp' instruction to compare the quotient with zero (32). As flags are set with the last instruction, we use the 'je' instruction (jump if equal) to jump to 'lsd' (least significant digit) if the leading digit is equal to zero, therefore not going through the process of writing it.

After changing 'max' on line 5, to 31, we get the following output:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30

Now that we have the x86_64 Assembly code figured out, let's analyze the equivalent code for Aarch64:

Right on line 8 we can see the first difference. The instructions are quite different for this architecture. We can see that the register r28, the first argument, receives 'start'. One instruction quite different is the division instruction. On x86_64, it provides the quotient and the remainder on specific registers. On Aarch64, the udiv requires three arguments: a register to store the quotient, a register with the dividend and a register with the divisor. The remainder isn't provided. Instead, to get the remainder there is the need to use a separate instruction called 'msub':

            msub r0,r3,r2,r1    ->    r0 = r1 - ( r2 * r3 )

And the remainder can be calculated:

            remainder = dividend - ( quotient * divisor )

Another difference is when sending the data to be written to the 'digit' address. On the x86_64, 'movb' was used to send a single byte, but on Aarch64, 'str' was used to store a value on a place in memory, and its leading bits are zeroed out.

Debugging

While Aarch64 seemed to be more efficient and have various seemingly useful instructions, I found that x86_64 is more intuitive, but requires more manual handling. Plus I found that it had much more online resources available.

It is very hard to be efficient in assembly debugging in general without the use of a tool that allows for stepping and viewing of values during run time (e.g gdb). I had to rely on methodology a lot, and doing the steps on paper. A tool for debugging would be absolutely essential for larger projects.

Conclusion

Assembly code is the closest to the metal as it gets, unless you're willing to program in hex. It is extremely efficient for its direct communication with the CPU, but also a simple program can take pages of code and be very hard to debug. It can be very powerful and useful on certain places, but mostly embedded into low level programming. It also requires a lot of repetitive coding, as it is not portable to other architectures and there might even be different solutions that function better on each of them. Aarch64 can still get a lot of optimization for source code that was built specifically for x86_64.

Friday, January 20, 2017

Lab 2

For this assignment, two open source packages with different licenses are to be obtained, built and executed in a Linux environment.

Emacs

The legendary text editor is GPLv3 licensed and its source can be downloaded via the git repository with the following command:

git clone -b master git://git.sv.gnu.org/emacs.git

For the installation, I'll be using the 'betty' AArch64 system. Once downloaded, The first file I opened was the README file, where it stated that the INSTALL file should have the instructions necessary for installation. The INSTALL file states that the installation starts by first running the 'configure' executable, where it tries to deduce the correct values for various system-dependent variables and features. To generate the 'configure' executable, I executed 'autogen.sh'. When running './configure', it stated that I didn't have a package called 'makeinfo' with a version of 4.13 or higher, but that I could rerun the command with the argument '--without-makeinfo' and not have the info pages for Emacs. For the purpose of this article, I ran with the argument.

With the configuration done, the next step is to run 'make', to build the software executable into the 'src' directory. Once done (which took about twenty minutes), Emacs can be executed from its directory. To install it on the system, 'make install' should be run, although superuser privileges are needed.

Ncdu

This is a command line program that analyses disk usage (du) and conveniently displays it with the Ncurses (nc) library. It's a relatively small package, but can be very useful on terminal-only systems (e.g ssh connections). Ncdu is MIT licensed.

Once cloned into betty, I'm ready to start the build. The README file stated that the build should start by running the command 'autoreconf -i' when building directly from the git repository. The rest of the installation is really straight forward. Just running the commands './configure --prefix=/usr' followed by 'make', built the package executable. The build didn't take more than 10 seconds.

Afterthoughts

Building software can take a while, depending on its complexity and size. Emacs is quite a large software compared to other terminal applications (although it might even already come with graphical support out of the box) and it took quite a while for generating an executable. It is a project that has been in development for decades and it only tends to get larger, as its almost religious users keep contributing to its development.

Ncdu is a good example of a lightweight and useful utility. It is intended to do one thing and do it well. It is unlikely that it will become a large application, especially being command line focused, so its build process won't take more than a few seconds.

Lab 1

The focus of this post is to compare the development process two open source projects with different licenses. More specifically, the way each community accepts code into their so beloved project.

Atom

Built on Electron, Atom is a fully hackable text editor with focus on development and customization. It is a strong alternative to the (in)famous Sublime Text, where customization can sometimes get in the way of productivity and the free version will pop a window once in a while to offer the full version. Atom is open source and for anyone. Atom and its core packages are MIT licensed and hosted at Github.

The MIT license is the shortest and most liberal of the open source licenses. Anyone can modify it, as long as the project creator is credited, and there are no liabilities regarding the quality of the software. That means that it is really straight forward to contribute. All that is necessary is the effort to improve the project.

The project creators use Git as the version control. To contribute, the new assets must be sent by performing a pull request. The request is then reviewed according to some guidelines and after thorough review, the request is accepted into the core project. Although there is a detailed review, the process is relatively fast. Here is an example of a pull request for a bug fix, which got through to the core code after three days.

Mozilla

The giant Mozilla is the other open source project I've picked for this assignment. It is truly makes an example of open source development done well and really, it's what saved it from not existing these days (formerly Netscape).

It uses the MPL license. According to Mozilla "the MPL fills a useful space in the spectrum of free and open source software licenses, sitting between the Apache license, which does not require modifications to be shared, and the GNU family of licenses, which requires modifications to be shared under a much broader set of circumstances than the MPL."

Mozilla is a huge deal in the open source world, and so the contribution process is longer and stricter than most. That said, the people at Mozilla want your contribution, so they help by walking you through the process in a detailed manner. The patches take more steps to get through and there is a lot of feedback and support from other developers. This patch had took about 10 months to get through since its entry.

Blog Archive

Saturday, April 22, 2017

First Steps

Handling Vectors

Testing

Conclusion

Friday, April 7, 2017

Setting up

Next Steps

Tuesday, April 4, 2017

Understanding qsort

Under the hood

Optimiz(ing|able\?)

Thursday, March 30, 2017

MP3

Encoding

Assembly Usage

Conclusion

Implementing inline assembly

Results!

Conclusion

Wednesday, March 22, 2017

Conclusion

Saturday, March 4, 2017

Using floating point for the volume scaling factor

Fixed-Point Numbers

Implementation

Results

Conclusion

Tuesday, February 28, 2017

Optimizing code for better performance results

Simple Volume Control

Loop-Unrolling

Tuesday, January 31, 2017

'-static' argument

Using built in function optimizations

No debugging enabled

Using arguments on 'printf'

Output function

Full Optimization

Conclusion

Sunday, January 29, 2017

Debugging

Conclusion

Friday, January 20, 2017