Saturday, April 22, 2017

Project - Optimizing wmemchr

As a continuation from the previous post, I figured out a way to optimize the wmemchr function in glibc. I'll go through the process of figuring it all out, but unfortunately, the final optimization didn't have good enough results to be worth submitting.

First Steps

The first operation I needed to worry about was getting the values into a vector register to compare with the given character. The 'CMEQ' instruction was pointed out to me. It compares values between two vectors, setting the destination vector's lanes to all 1s if the element pair matches, or 0s if not. As SIMD instructions don't raise flags, this instruction tries to serve this purpose to accommodate the different vector elements.

I implemented this instruction to compare four pairs of 32-bit elements. One vector holds in the four lanes with the given wide character to look for, and the other holds four elements from the wide character array. The results were saved in a third vector.

Handling Vectors

Identifying which element matched was the trickiest part. It took three instructions to load the values and compare them, but I couldn't figure out an efficient way to know which vector element matched, using the SIMD instructions available. Instead, I simply opted to figure out if a 4-element set has a match, not knowing which one. That way, I wouldn't spend instructions analyzing each element individually, and would move on to the next set.

The way I did this was using an instruction called UMAXV. This instruction figures out the element with the highest value in one vector. 'U' is for unsigned, and 'V' indicates that it is a vector reduction instruction (operates in a single vector register). If there isn't a match, the highest value would be zero, and if there is one, the highest value would be four bytes of 1s. Then I pass this maximum value as an output from the inline assembly and check if it is not zero, in which case I go through the elements individually to find which one matches. Here's the final function:



wchar_t *
mynew__wmemchr (const wchar_t *s, wchar_t c, size_t n)
{

    while( n>=4 ){

        static unsigned int set;

        __asm__ __volatile__
                ("DUP v1.4s, %w[c]\n\t"
                "LD1 {v0.4s}, [%[s]]\n\t" //load values

                //if a lane is all 1s, it matches the character
                "CMEQ v2.4s, v0.4s, v1.4s\n\t"

                //"check" if there were any matches
                "UMAXV S3, v2.4s\n\t"
                //move v3.4s[0] to set
                "UMOV %w[set], v3.4s[0]\n\t"



                :[set]"=r"(set)//output
                :[s] "r" (s), [c] "r" (c)//input
                :);

        if (set){
            //found match in set
            if (s[0] == c)
                return (wchar_t *) s;
            if (s[1] == c)
                return (wchar_t *) &s[1];
            if (s[2] == c)
                return (wchar_t *) &s[2];
            if (s[3] == c)
                return (wchar_t *) &s[3];
        }

      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;

}



For some reason, the UMAXV instruction does not accept the first operand being 'v3.4s', but 'S3' means basically the same.

As we can see by comparing to the original function, I ended up adding content to it, so it makes less comparisons, but the end results weren't as great as I expected, and it is probably due to the fact that UMAXV makes even more comparisons to determine the highest number in a vector.

Testing

To test the performance of the functions, I created a test file to compare the execution time of the new function to the original one, and compiled via a makefile into executables with different optimization levels. The test allocates a wide character array of 1 billion elements, and the very last character is the one it will be looking for. The test also includes the time taken using the glibc 2.17 shared system library, which has a consistent time of 920 milliseconds. These were the results:



Conclusion

The final results showed a reasonable improvement when compiling with no optimization options and the 'O1' option, but since the system library is optimized with 'O2' or 'O3', the improvement isn't enough to submit a pull request, even more so due to the fact that it only targets one platform.

The SIMD instructions are great for processing bulk data, but I found that narrowing results from it can become quite tricky.

Friday, April 7, 2017

SPO600 Project - wmemchr

I decided trying to improve the function wmemchr on the glibc library. The function searches a wide character (c) in a wide-character array (s). It is similar to wcschr, except that it takes the array size (n).


#include <wchar.h>

       wchar_t *wmemchr(const wchar_t *s, wchar_t c, size_t n);

This functions exactly as the memchr function, but for wide characters. Wide character in C has a type of wchar_t, which is a 4-byte value (for glibc).

The existing function is fairly simple:

wchar_t *
wmemchr (const wchar_t *s, wchar_t c, size_t n)
{
  /* For performance reasons unfold the loop four times.  */
  while (n >= 4)
    {
      if (s[0] == c)
    return (wchar_t *) s;
      if (s[1] == c)
    return (wchar_t *) &s[1];
      if (s[2] == c)
    return (wchar_t *) &s[2];
      if (s[3] == c)
    return (wchar_t *) &s[3];
      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;
}

It first tries to do as few comparisons as possible by doing loop unrolling, checking 4 values each loop. Then, if the wide character wasn't found, it looks for it in the last remaining samples, if any. Otherwise, return NULL, indicating the character wasn't found.

Setting up

To be able to test improvements, I created a test file that generates an array of 1 billion wchar_t characters by allocating it on the heap. The very last character is different from all the previous ones, and that's the one we'll be looking for.

I copied the function to the file, and gave it a different name. Then I performed 2 speed tests to look for the character. The first test uses the shared object in the system and the second uses the function I copied from glibc. I did that to have an idea of what optimization options were being used for that function when the system library was compiled. Here are the tests made on an Aarch64 system (betty).



The table illustrates the time comparison between executions of the same function using different compilation options and the system library. We see that O2 and O3 took basically the same time as the system, so we can assume that the C library was compiled with at least O2.

Furthermore, neither the O3 executable or the system library have the wmemchr function use vectorization. When I do 'objdump -d /lib64/libc.so.6 | less' and see the assembly code for the function, I see no use use vectors, but neither does the O3 executable for my tests. This means I should at least assume that O2 was used, but not necessarily O3.
 
 

Next Steps

Next, I'll try to introduce inline assembly into the code, using vectorization. The idea is to replace the contents of the first while loop, since most of the processing is done there, and figure out a way to process 4 32-bit wide values at once, using SIMD instructions for the NEON 128-bit extension found in the ARM Cortex-A series.

The tricky part is not comparing the 4 values, but actually having the assembler instructions return the one that matches between the 4, without making too many comparisons, which would impact on the performance. The approach I'm thinking of is to quickly figure out if a set of values contains the one I'm looking for. If it doesn't, just get the next set of values to avoid wasting precious time. But if it does, then it can take as many instructions necessary, because the value is already close enough.

Tuesday, April 4, 2017

qsort - The useful but hard to master glibc function

 Understanding qsort


For my project I decided to give a try in optimizing the qsort function in glibc. The function sorts an array of nmemb elements, each of size size. The interesting feature of the function is that it can be used to sort any type of array. One of the function arguments is a pointer to a comparison function, which the programmer creates. This allows sorting integers, strings, or even custom structs. Here's the prototype:

 #include <stdlib.h>  
   
     void qsort(void *base, size_t nmemb, size_t size,  
          int (*compar)(const void *, const void *));  
   
     void qsort_r(void *base, size_t nmemb, size_t size,  
          int (*compar)(const void *, const void *, void *),  
          void *arg);  


The base argument receives an array, and the comparison function compar returns an integer and must receive two void pointers to the elements to be compared. The comparison function must return a  negative number if the elements are out of order or a positive number if the elements are in order. If the integer returned is zero, the ordering of the two elements in the array is unpredictable. 

 int  
 compare_doubles (const void *a, const void *b)  
 {  
  const double *da = (const double *) a;  
  const double *db = (const double *) b;  
  return (*da > *db) - (*da < *db);  
 }  

Another example would be using an existing function, like strcmp. Although this function doesn't receive void pointer arguments, it can be wrapped inside another function. This is an example from the qsort man page:


 #include <string.h>  
   
     static int  
     cmpstringp(const void *p1, const void *p2)  
     {  
       /* The actual arguments to this function are "pointers to  
        pointers to char", but strcmp(3) arguments are "pointers  
        to char", hence the following cast plus dereference */  
   
       return strcmp(* (char * const *) p1, * (char * const *) p2);  
     }  

For a given array of doubles called arr, of size NUM_ELEM, the qsort function would be called like so:

   qsort( arr, NUM_ELEM, sizeof(double), compare_doubles);  

I haven't yet mentioned the qsort_r alternative. This is an implementation derived from BSD to make the qsort reentrant safe. The additional argument serves to pass a mutable state to the function (e.g global variable) and implies that your comparison function must also be reentrant safe.


Under the hood


As most things in life, the deeper you go, the uglier it gets.  To understand how qsort is being called, we need to dig into the library. The functions qsort  and qsort_r are defined in a file called msort.c (why?) in the standard library.

The function qsort simply calls qsort_r with a last argument of NULL. qsort_r starts by checking sizes, determining if a temporary array is being allocated in the stack or heap, if there is enough memory to allocate on the heap, and other checks to determine how and where the given array will be processed. For example, if memory requirements are too high, sorting will be done on the array itself instead of creating a temporary copy.

From what I understood, there are two possible functions that can be called to sort the given array, depending on the size of the array and the available memory on the computer. The two functions are __quicksort and msort_with_tmp. Neither of those are documented at all and only the first has comments on the source. The comments on qsort_r imply that __quicksort is slower and is only used if a temporary array couldn't be allocated in memory, as it "doesn't need a temporary array". For any other case, msort_with_temp is used to indirectly sort the array.

Optimiz(ing|able\?)


After going through all that horrible looking code, I started to extract the pieces I thought necessary to do some benchmarking and profiling. There are so many moving pieces that it became really hard to just set up a development environment to work on. With no success, my only other option would be to work on the code itself, but that would have two major disadvantages: I wouldn't be able to do profiling on the different sections of the code; Every change would need to undergo a complete glibc compilation. So aside from getting vague performance results, each change to the code would take 10 minutes or more to be tested.

Now, even assuming I had none of the previous problems, and that I could have set up a development environment, there is the question of the possibility for optimization. Sorting is a deeply studied subject in computer science, and even then it isn't as complicated as actually implementing it to support any data type or structure, determining what sorting algorithm is best for a given situation at runtime, being reentrant safe and standards compliant, and making it perform as fast as possible, all while being portable between various architectures. It is a lot to take on, and some have. From the comments on the source, we can see that a lot of the code was referenced from books and other sources in which the authors extensively studied the subject, so the code might already be as optimized as possible.

So why did I choose qsort? My original idea was to profile the "function" and determine at which step it was taking the most time on. Then focus on that part and maybe implement it in A64 assembly using vectorization. Unfortunately, I couldn't get to really test it, so I'm moving on to another function.

It also didn't help that this code was heavily "patched" over the years. There is so much going on just to determine what sorting algorithm to call. It feels like this set of functions fail to follow part of the Unix philosophy of "doing one thing" and just focus on the "doing it well" part. I say it was "patched" because we can see how all this originated. Also, here is a more candour informal perspective on the matter.