Saturday, April 22, 2017

Project - Optimizing wmemchr

As a continuation from the previous post, I figured out a way to optimize the wmemchr function in glibc. I'll go through the process of figuring it all out, but unfortunately, the final optimization didn't have good enough results to be worth submitting.

First Steps

The first operation I needed to worry about was getting the values into a vector register to compare with the given character. The 'CMEQ' instruction was pointed out to me. It compares values between two vectors, setting the destination vector's lanes to all 1s if the element pair matches, or 0s if not. As SIMD instructions don't raise flags, this instruction tries to serve this purpose to accommodate the different vector elements.

I implemented this instruction to compare four pairs of 32-bit elements. One vector holds in the four lanes with the given wide character to look for, and the other holds four elements from the wide character array. The results were saved in a third vector.

Handling Vectors

Identifying which element matched was the trickiest part. It took three instructions to load the values and compare them, but I couldn't figure out an efficient way to know which vector element matched, using the SIMD instructions available. Instead, I simply opted to figure out if a 4-element set has a match, not knowing which one. That way, I wouldn't spend instructions analyzing each element individually, and would move on to the next set.

The way I did this was using an instruction called UMAXV. This instruction figures out the element with the highest value in one vector. 'U' is for unsigned, and 'V' indicates that it is a vector reduction instruction (operates in a single vector register). If there isn't a match, the highest value would be zero, and if there is one, the highest value would be four bytes of 1s. Then I pass this maximum value as an output from the inline assembly and check if it is not zero, in which case I go through the elements individually to find which one matches. Here's the final function:



wchar_t *
mynew__wmemchr (const wchar_t *s, wchar_t c, size_t n)
{

    while( n>=4 ){

        static unsigned int set;

        __asm__ __volatile__
                ("DUP v1.4s, %w[c]\n\t"
                "LD1 {v0.4s}, [%[s]]\n\t" //load values

                //if a lane is all 1s, it matches the character
                "CMEQ v2.4s, v0.4s, v1.4s\n\t"

                //"check" if there were any matches
                "UMAXV S3, v2.4s\n\t"
                //move v3.4s[0] to set
                "UMOV %w[set], v3.4s[0]\n\t"



                :[set]"=r"(set)//output
                :[s] "r" (s), [c] "r" (c)//input
                :);

        if (set){
            //found match in set
            if (s[0] == c)
                return (wchar_t *) s;
            if (s[1] == c)
                return (wchar_t *) &s[1];
            if (s[2] == c)
                return (wchar_t *) &s[2];
            if (s[3] == c)
                return (wchar_t *) &s[3];
        }

      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;

}



For some reason, the UMAXV instruction does not accept the first operand being 'v3.4s', but 'S3' means basically the same.

As we can see by comparing to the original function, I ended up adding content to it, so it makes less comparisons, but the end results weren't as great as I expected, and it is probably due to the fact that UMAXV makes even more comparisons to determine the highest number in a vector.

Testing

To test the performance of the functions, I created a test file to compare the execution time of the new function to the original one, and compiled via a makefile into executables with different optimization levels. The test allocates a wide character array of 1 billion elements, and the very last character is the one it will be looking for. The test also includes the time taken using the glibc 2.17 shared system library, which has a consistent time of 920 milliseconds. These were the results:



Conclusion

The final results showed a reasonable improvement when compiling with no optimization options and the 'O1' option, but since the system library is optimized with 'O2' or 'O3', the improvement isn't enough to submit a pull request, even more so due to the fact that it only targets one platform.

The SIMD instructions are great for processing bulk data, but I found that narrowing results from it can become quite tricky.

Friday, April 7, 2017

SPO600 Project - wmemchr

I decided trying to improve the function wmemchr on the glibc library. The function searches a wide character (c) in a wide-character array (s). It is similar to wcschr, except that it takes the array size (n).


#include <wchar.h>

       wchar_t *wmemchr(const wchar_t *s, wchar_t c, size_t n);

This functions exactly as the memchr function, but for wide characters. Wide character in C has a type of wchar_t, which is a 4-byte value (for glibc).

The existing function is fairly simple:

wchar_t *
wmemchr (const wchar_t *s, wchar_t c, size_t n)
{
  /* For performance reasons unfold the loop four times.  */
  while (n >= 4)
    {
      if (s[0] == c)
    return (wchar_t *) s;
      if (s[1] == c)
    return (wchar_t *) &s[1];
      if (s[2] == c)
    return (wchar_t *) &s[2];
      if (s[3] == c)
    return (wchar_t *) &s[3];
      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;
}

It first tries to do as few comparisons as possible by doing loop unrolling, checking 4 values each loop. Then, if the wide character wasn't found, it looks for it in the last remaining samples, if any. Otherwise, return NULL, indicating the character wasn't found.

Setting up

To be able to test improvements, I created a test file that generates an array of 1 billion wchar_t characters by allocating it on the heap. The very last character is different from all the previous ones, and that's the one we'll be looking for.

I copied the function to the file, and gave it a different name. Then I performed 2 speed tests to look for the character. The first test uses the shared object in the system and the second uses the function I copied from glibc. I did that to have an idea of what optimization options were being used for that function when the system library was compiled. Here are the tests made on an Aarch64 system (betty).



The table illustrates the time comparison between executions of the same function using different compilation options and the system library. We see that O2 and O3 took basically the same time as the system, so we can assume that the C library was compiled with at least O2.

Furthermore, neither the O3 executable or the system library have the wmemchr function use vectorization. When I do 'objdump -d /lib64/libc.so.6 | less' and see the assembly code for the function, I see no use use vectors, but neither does the O3 executable for my tests. This means I should at least assume that O2 was used, but not necessarily O3.
 
 

Next Steps

Next, I'll try to introduce inline assembly into the code, using vectorization. The idea is to replace the contents of the first while loop, since most of the processing is done there, and figure out a way to process 4 32-bit wide values at once, using SIMD instructions for the NEON 128-bit extension found in the ARM Cortex-A series.

The tricky part is not comparing the 4 values, but actually having the assembler instructions return the one that matches between the 4, without making too many comparisons, which would impact on the performance. The approach I'm thinking of is to quickly figure out if a set of values contains the one I'm looking for. If it doesn't, just get the next set of values to avoid wasting precious time. But if it does, then it can take as many instructions necessary, because the value is already close enough.