Friday, April 7, 2017

SPO600 Project - wmemchr

I decided trying to improve the function wmemchr on the glibc library. The function searches a wide character (c) in a wide-character array (s). It is similar to wcschr, except that it takes the array size (n).


#include <wchar.h>

       wchar_t *wmemchr(const wchar_t *s, wchar_t c, size_t n);

This functions exactly as the memchr function, but for wide characters. Wide character in C has a type of wchar_t, which is a 4-byte value (for glibc).

The existing function is fairly simple:

wchar_t *
wmemchr (const wchar_t *s, wchar_t c, size_t n)
{
  /* For performance reasons unfold the loop four times.  */
  while (n >= 4)
    {
      if (s[0] == c)
    return (wchar_t *) s;
      if (s[1] == c)
    return (wchar_t *) &s[1];
      if (s[2] == c)
    return (wchar_t *) &s[2];
      if (s[3] == c)
    return (wchar_t *) &s[3];
      s += 4;
      n -= 4;
    }

  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    {
      if (*s == c)
    return (wchar_t *) s;
      ++s;
      --n;
    }
  if (n > 0)
    if (*s == c)
      return (wchar_t *) s;

  return NULL;
}

It first tries to do as few comparisons as possible by doing loop unrolling, checking 4 values each loop. Then, if the wide character wasn't found, it looks for it in the last remaining samples, if any. Otherwise, return NULL, indicating the character wasn't found.

Setting up

To be able to test improvements, I created a test file that generates an array of 1 billion wchar_t characters by allocating it on the heap. The very last character is different from all the previous ones, and that's the one we'll be looking for.

I copied the function to the file, and gave it a different name. Then I performed 2 speed tests to look for the character. The first test uses the shared object in the system and the second uses the function I copied from glibc. I did that to have an idea of what optimization options were being used for that function when the system library was compiled. Here are the tests made on an Aarch64 system (betty).



The table illustrates the time comparison between executions of the same function using different compilation options and the system library. We see that O2 and O3 took basically the same time as the system, so we can assume that the C library was compiled with at least O2.

Furthermore, neither the O3 executable or the system library have the wmemchr function use vectorization. When I do 'objdump -d /lib64/libc.so.6 | less' and see the assembly code for the function, I see no use use vectors, but neither does the O3 executable for my tests. This means I should at least assume that O2 was used, but not necessarily O3.
 
 

Next Steps

Next, I'll try to introduce inline assembly into the code, using vectorization. The idea is to replace the contents of the first while loop, since most of the processing is done there, and figure out a way to process 4 32-bit wide values at once, using SIMD instructions for the NEON 128-bit extension found in the ARM Cortex-A series.

The tricky part is not comparing the 4 values, but actually having the assembler instructions return the one that matches between the 4, without making too many comparisons, which would impact on the performance. The approach I'm thinking of is to quickly figure out if a set of values contains the one I'm looking for. If it doesn't, just get the next set of values to avoid wasting precious time. But if it does, then it can take as many instructions necessary, because the value is already close enough.

No comments:

Post a Comment