Saturday, April 22, 2017

Project - Optimizing wmemchr

As a continuation from the previous post, I figured out a way to optimize the wmemchr function in glibc. I'll go through the process of figuring it all out, but unfortunately, the final optimization didn't have good enough results to be worth submitting.

First Steps

The first operation I needed to worry about was getting the values into a vector register to compare with the given character. The 'CMEQ' instruction was pointed out to me. It compares values between two vectors, setting the destination vector's lanes to all 1s if the element pair matches, or 0s if not. As SIMD instructions don't raise flags, this instruction tries to serve this purpose to accommodate the different vector elements.

I implemented this instruction to compare four pairs of 32-bit elements. One vector holds in the four lanes with the given wide character to look for, and the other holds four elements from the wide character array. The results were saved in a third vector.

Handling Vectors

Identifying which element matched was the trickiest part. It took three instructions to load the values and compare them, but I couldn't figure out an efficient way to know which vector element matched, using the SIMD instructions available. Instead, I simply opted to figure out if a 4-element set has a match, not knowing which one. That way, I wouldn't spend instructions analyzing each element individually, and would move on to the next set.

The way I did this was using an instruction called UMAXV. This instruction figures out the element with the highest value in one vector. 'U' is for unsigned, and 'V' indicates that it is a vector reduction instruction (operates in a single vector register). If there isn't a match, the highest value would be zero, and if there is one, the highest value would be four bytes of 1s. Then I pass this maximum value as an output from the inline assembly and check if it is not zero, in which case I go through the elements individually to find which one matches. Here's the final function:



wchar_t *
mynew__wmemchr (const wchar_t *s, wchar_t c, size_t n)
{

    while( n>=4 ){

        static unsigned int set;

        __asm__ __volatile__
                ("DUP v1.4s, %w[c]\n\t"
                "LD1 {v0.4s}, [%[s]]\n\t" //load values

                //if a lane is all 1s, it matches the character
                "CMEQ v2.4s, v0.4s, v1.4s\n\t"

                //"check" if there were any matches
                "UMAXV S3, v2.4s\n\t"
                //move v3.4s[0] to set
                "UMOV %w[set], v3.4s[0]\n\t"



                :[set]"=r"(set)//output
                :[s] "r" (s), [c] "r" (c)//input
                :);

        if (set){
            //found match in set
            if (s[0] == c)
                return (wchar_t *) s;
            if (s[1] == c)
                return (wchar_t *) &s[1];
            if (s[2] == c)
                return (wchar_t *) &s[2];
            if (s[3] == c)
                return (wchar_t *) &s[3];
        }

      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;

}



For some reason, the UMAXV instruction does not accept the first operand being 'v3.4s', but 'S3' means basically the same.

As we can see by comparing to the original function, I ended up adding content to it, so it makes less comparisons, but the end results weren't as great as I expected, and it is probably due to the fact that UMAXV makes even more comparisons to determine the highest number in a vector.

Testing

To test the performance of the functions, I created a test file to compare the execution time of the new function to the original one, and compiled via a makefile into executables with different optimization levels. The test allocates a wide character array of 1 billion elements, and the very last character is the one it will be looking for. The test also includes the time taken using the glibc 2.17 shared system library, which has a consistent time of 920 milliseconds. These were the results:



Conclusion

The final results showed a reasonable improvement when compiling with no optimization options and the 'O1' option, but since the system library is optimized with 'O2' or 'O3', the improvement isn't enough to submit a pull request, even more so due to the fact that it only targets one platform.

The SIMD instructions are great for processing bulk data, but I found that narrowing results from it can become quite tricky.

No comments:

Post a Comment