Tue, 20 May 2008

(There is an HTML version at <http://canonical.org/~kragen/strlen-utf8.html>.)

On IRC, [Aristotle Pagaltzis](http://plasmasturm.org) was pondering
how much performance variable-width encodings such as UTF-8 actually
cost, because it's commonly suggested that fixed-width encodings such
as ISO-8859-1 and UCS-4 are much faster.

He suggested:

> Huh, it just occurs to me that `strlen` is not at all expensive on
> UTF-8-encoded strings.  Not exactly as fast, but if you write it
> in asm, it only takes one extra instruction to count characters in
> UTF-8 vs those in an 8-bit encoding, per character. So, if you
> factor in cache misses, it should make no measurable
> difference. All you lose with a variable-width encoding is direct
> random access to arbitrary indices in the string, which is
> basically a non-use case.

It turned out that he was partly wrong, but mostly right.  And along
the way, we discovered that GCC's standard implementation of `strlen`
was quite pessimal.

I'm using Linux on a 700MHz Pentium III laptop with GCC 4.1.2, using
just the `-O` flag unless otherwise specified.

My First Assembly Version
-------------------------

First I thought about how to write `strlen`, and came up with this:

            .global my_strlen_s
    my_strlen_s:
            push %esi                
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
            ## repnz lodsb doesn't work because lodsb doesn't update ZF
    loop:   lodsb
            test %al, %al
            loopnz loop
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

> For those who aren't well-versed in 80386 assembly language:
> 
> - `lodsb` reads a byte from memory at `%esi`, puts it in `%al`, and
>   increments `%esi`;
> - `loopnz` decrements `%ecx` each time through the loop, and jumps
>   back to the label `loop:` if `%ecx` isn't zero and if the zero
>   flag `ZF` isn't set (that's the "nz" part);
> - `test %al, %al` sets the zero flag `ZF` if `%al` is zero (the C
>   string terminator), among other things.
> - after the loop body has run N times, `%ecx` is -N, so `not %eax`
>   converts the negative number -N from `%ecx` into a positive number
>   N-1.
> - you have to push `%esi` because it's a callee-saves register.  (To
>   my surprise.)
> 
> But there's a `scasb` instruction which I should have used instead
> of `lodsb; test`.

The inner loop there is three instructions.  

GCC's Assembly Version
----------------------

Then I looked at how GCC does `strlen`.  It turns out it inlines it,
even without any extra optimization flags.  Here's a sample call,
albeit with optimization:

     80484d5:	bf 00 99 04 08       	mov    $0x8049900,%edi
     80484da:	fc                   	cld    
     80484db:	b9 ff ff ff ff       	mov    $0xffffffff,%ecx
     80484e0:	b0 00                	mov    $0x0,%al
     80484e2:	f2 ae                	repnz scas %es:(%edi),%al
     80484e4:	f7 d1                	not    %ecx
     80484e6:	49                   	dec    %ecx
     80484e7:	89 4c 24 04          	mov    %ecx,0x4(%esp)

There the inner loop is just the `repnz scas`.  I'd forgotten about
`SCAS`.

A C Version
-----------

Here's a reasonable `strlen` in C:

    int my_strlen(char *s) {
      int i = 0;
      while (*s++) i++;
      return i;
    }

This compiles to the following:

    080483c4 <my_strlen>:
     80483c4:	55                   	push   %ebp
     80483c5:	89 e5                	mov    %esp,%ebp
     80483c7:	8b 55 08             	mov    0x8(%ebp),%edx
     80483ca:	b8 00 00 00 00       	mov    $0x0,%eax
     80483cf:	80 3a 00             	cmpb   $0x0,(%edx)
     80483d2:	74 0c                	je     80483e0 <my_strlen+0x1c>
     80483d4:	b8 00 00 00 00       	mov    $0x0,%eax
     80483d9:	40                   	inc    %eax
     80483da:	80 3c 10 00          	cmpb   $0x0,(%eax,%edx,1)
     80483de:	75 f9                	jne    80483d9 <my_strlen+0x15>
     80483e0:	5d                   	pop    %ebp
     80483e1:	c3                   	ret    

So here the inner loop is again three instructions: `inc %eax`; `cmpb
0, (%eax,%edx,1)`; `jne`.  It's been optimized down to `while (s[i])
i++;`.  The loop termination test is duplicated above the top of the
loop for the empty-string case.

My UTF-8 Assembly Version
-------------------------

So then I thought about how to do what Aristotle was suggesting.  In
UTF-8, bytes that start new characters begin either with binary 0 or
binary 11; the second and subsequent bytes of multibyte characters
have binary 10 as their high bits.  So to count the characters, you
just have to count the bytes that don't begin with binary 10.

I tried this:

            .global my_strlen_utf8_s
    my_strlen_utf8_s:
            push %esi
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
    loop2:  lodsb
            test $0x80, %al
            jz ascii                # format 0xxx xxxx
            test $0x40, %al
            jz loop2                # format 10xx xxxx: doesn't start new char
    ascii:  test %al, %al
            loopnz loop2
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

So here we jump back to `loop2`, rather than decrementing `%ecx` with
`loopnz`, in the case where the byte starts with 10.  And we can skip
the `test %al, %al` test, since 0000 0000 doesn't start with 10.

The inner loop of this version has 5 instructions, including two taken
conditional branches, in the usual ASCII case, and 7 instructions for
non-ASCII bytes, rather than 3 instructions per byte.  That's only two
extra instructions in the "usual" case, but if every instruction were
one cycle, that would still be a 67% increase in run-time.

Counting instructions and adding up their cycle count isn't a very
accurate way to measure performance in these superscalar days, though.

My UTF-8 C Version
------------------

A C version looks like this:

    int my_strlen_utf8_c(char *s) {
      int i = 0, j = 0;
      while (s[i]) {
        if ((s[i] & 0xc0) != 0x80) j++;
        i++;
      }
      return j;
    }

GCC compiles it to this:

    080483e2 <my_strlen_utf8_c>:
     80483e2:	55                   	push   %ebp
     80483e3:	89 e5                	mov    %esp,%ebp
     80483e5:	8b 55 08             	mov    0x8(%ebp),%edx
     80483e8:	0f b6 02             	movzbl (%edx),%eax
     80483eb:	b9 00 00 00 00       	mov    $0x0,%ecx
     80483f0:	84 c0                	test   %al,%al
     80483f2:	74 23                	je     8048417 <my_strlen_utf8_c+0x35>
     80483f4:	b9 00 00 00 00       	mov    $0x0,%ecx
     80483f9:	0f be c0             	movsbl %al,%eax
     80483fc:	25 c0 00 00 00       	and    $0xc0,%eax
     8048401:	3d 80 00 00 00       	cmp    $0x80,%eax
     8048406:	0f 95 c0             	setne  %al
     8048409:	0f b6 c0             	movzbl %al,%eax
     804840c:	01 c1                	add    %eax,%ecx
     804840e:	0f b6 42 01          	movzbl 0x1(%edx),%eax
     8048412:	42                   	inc    %edx
     8048413:	84 c0                	test   %al,%al
     8048415:	75 e2                	jne    80483f9 <my_strlen_utf8_c+0x17>
     8048417:	89 c8                	mov    %ecx,%eax
     8048419:	5d                   	pop    %ebp
     804841a:	c3                   	ret    

An inner loop of 10 instructions --- but containing only a single
conditional jump, the `jne` at the bottom.  It uses the `and`; `cmp`;
`setne`; `movzbl` sequence to put either a 0 or a 1 into `%eax`,
depending on whether the byte fetched began with 10, and adds the
result into `%ecx` each time through the loop.

Aristotle's UTF-8 Assembly Version
----------------------------------

So after all this, I chatted with Aristotle some more, and it turned
out he had a much cleverer trick up his sleeve than I had thought ---
or, in fact, than he had thought.  He wrote:

> But wow, my code is *much* faster than any of the other variants.
> Unexpectedly.

Here's his version:

            .global ap_strlen_utf8_s
    ap_strlen_utf8_s:
            push %esi
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
    loopa:  dec %ecx
    loopb:  lodsb
            shl $1, %al
            js loopa
            jc loopb
            jnz loopa
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

In this case, the inner loop is 6 instructions, but as few as 3 of
them can execute.  I hadn't realized that you could get the top two
bits of a byte into the carry and sign flags with a single `shl`
instruction like that!  Aristotle explains:

> `Js` catches all bytes of the form x1xxxxxx. `Jc` catches 1xxxxxxx,
> but because `js` came first, that can only have been 10xxxxxx; and
> `jnz` then catches all 00xxxxxx other than all-0.  This runs about
> 3x as fast as your `my_strlen_s` --- most of the time, anyway.

Performance Results
-------------------

So how do these different approaches fare?  I wrote a program that
creates a 32MB string and timed the different functions on it, in
seconds, using wall-clock time.  Here are the results from one run,
sorted with `sort -t: -k1 -k3 -ns`.  The first few lines are various
functions' return values on the given strings.

    "": 0 0 0 0 0 0
    "hello, world": 12 12 12 12 12 12
    "naïve": 6 6 6 5 5 5
    "こんにちは": 15 15 15 5 5 5
    1: all 'a':
    1:                my_strlen(string) =   33554431: 0.227555
    1:         ap_strlen_utf8_s(string) =   33554431: 0.299494
    1:                   strlen(string) =   33554431: 0.314887
    1:         my_strlen_utf8_c(string) =   33554431: 0.380355
    1:              my_strlen_s(string) =   33554431: 0.432079
    1:         my_strlen_utf8_s(string) =   33554431: 0.525443
    2: all '\xe3':
    2:                my_strlen(string) =   33554431: 0.224037
    2:         ap_strlen_utf8_s(string) =   33554431: 0.299537
    2:                   strlen(string) =   33554431: 0.311552
    2:         my_strlen_utf8_c(string) =   33554431: 0.378162
    2:              my_strlen_s(string) =   33554431: 0.436755
    2:         my_strlen_utf8_s(string) =   33554431: 0.589165
    3: all '\x81':
    3:                my_strlen(string) =   33554431: 0.225011
    3:         ap_strlen_utf8_s(string) =          0: 0.313525
    3:                   strlen(string) =   33554431: 0.316182
    3:         my_strlen_utf8_s(string) =          0: 0.322959
    3:         my_strlen_utf8_c(string) =          0: 0.390958
    3:              my_strlen_s(string) =   33554431: 0.432342

The 33554431 and 0 numbers are the return values; this ensures that
GCC doesn't optimize out the `strlen` call.

So, on my CPU, the C version of `strlen` took about 28% less time than
the built-in inlined one for this long string; it only uses two
registers instead of the three used by the built-in inlined one (the
one that uses `repnz scasb`); and they both seem to be about 12 bytes.
I don't know why GCC inlines the worse one. Most likely it used to be
faster than whatever GCC generated at the time and hasn't been
revisited.

It's worth noting that while my C version of `strlen` was always
faster than the built-in version, Aristotle's UTF-8 version was always
in between.

On Aristotle's Core 2 Duo 1.8GHz (also with GCC 4.1.2 and `-O`), the
difference was very much greater.  Here are his results:

    "": 0 0 0 0 0 0
    "hello, world": 12 12 12 12 12 12
    "naïve": 6 6 6 5 5 5
    "こんにちは": 15 15 15 5 5 5
    1: all 'a':
    1:                my_strlen(string) =   33554431: 0.025906
    1:         ap_strlen_utf8_s(string) =   33554431: 0.039629
    1:         my_strlen_utf8_c(string) =   33554431: 0.096041
    1:                   strlen(string) =   33554431: 0.114821
    1:              my_strlen_s(string) =   33554431: 0.116529
    1:         my_strlen_utf8_s(string) =   33554431: 0.132648
    2: all '\xe3':
    2:                my_strlen(string) =   33554431: 0.025912
    2:         ap_strlen_utf8_s(string) =   33554431: 0.039583
    2:         my_strlen_utf8_c(string) =   33554431: 0.095699
    2:                   strlen(string) =   33554431: 0.114452
    2:              my_strlen_s(string) =   33554431: 0.114622
    2:         my_strlen_utf8_s(string) =   33554431: 0.136109
    3: all '\x81':
    3:                my_strlen(string) =   33554431: 0.026112
    3:         my_strlen_utf8_s(string) =          0: 0.039656
    3:         ap_strlen_utf8_s(string) =          0: 0.039661
    3:         my_strlen_utf8_c(string) =          0: 0.096416
    3:              my_strlen_s(string) =   33554431: 0.115327
    3:                   strlen(string) =   33554431: 0.116629

All of this code is online in two files:

* [mystrlen.c](http://pobox.com/~kragen/sw/mystrlen.c)
* [mystrlen.s](http://pobox.com/~kragen/sw/mystrlen.s)

Conclusions
-----------

1. GCC is better at writing x86 assembly than I am.  No surprise
   there.  Even when its inner loop is 10 instructions, it beats my
   three-instruction inner loops for speed.

2. Aristotle is better at writing x86 assembly than GCC is.

3. Aristotle was essentially correct: the penalty for counting UTF-8
   characters, or indexing into or iterating over the characters of a
   UTF-8 string, is very small.

4. However, there is a speed penalty.  Although GCC's built-in
   `strlen` is much slower than Aristotle's function, a
   straightforward byte-counting C `strlen` compiled with optimization
   is faster still.

5. GCC should change to use the straightforward byte-counting C
   `strlen` instead of what it currently inlines.  The version of
   strlen that GCC inlines is worse than the one it compiled from C in
   every way: it's more instructions, more bytes of machine code, four
   times slower, and uses more registers (one of which is a
   callee-saves register!).

6. People probably shouldn't worry about the efficiency of counting
   and iterating over characters in UTF-8 strings, at least not if
   they were using null-terminated C strings before.

(I wrote this 2008-04-19, when Buenos Aires was still under a blanket
of heavy smoke.)

I went out in the smoke tonight, Saturday night, to try to get food
from Chinatown, despite Beatrice's protestations that 20:00 was too
late.

As I slowly walked the few blocks to the route 107 bus stop, three 107
buses passed me.  I waited at the bus stop as two 107 buses passed
going the other way; while I waited, standing in the street, other
would-be passengers accumulated: a bald man with gray hair cuddling
and kissing with his middle-aged girlfriend as they stood in the
street behind me, and two teenagers.

Eventually I gave up on the bus and hailed a passing taxi, which I
took to a Citibank near Chinatown, where I extracted money from my
bank account via an ATM.

The sidewalk cafes in the commercial district were full of people,
despite the smoke blanketing the city; I recognized an acquaintance
waitressing at the the restaurant "1810", where we first tasted
Argentine empanadas.  A few blocks away, as I walked in the direction
of the 107 bus route and Chinatown, I found a long line of mostly old
people.  I asked a young man standing in line what the line was for.
He didn't answer for a moment, and then without meeting my eyes, he
explained that it was for bread.

I walked along what I thought was the 107 bus route, but I arrived in
Chinatown before seeing any more 107 buses.  The store I had hoped to
go to had closed at 20:30; I walked around looking for an open store,
so I could buy peanut butter, ginger root, and packaged ramen.  (Ramen
only costs $2 a package there.)

I passed a couple of young men with small shopping carts full to the
brim of 1.5-liter Quilmes beer bottles, waiting to be let into an
apartment complex; elsewhere I passed one or another sentry waiting at
a door, presumably to let in people who had gone out.

After walking about six blocks through almost all of Chinatown, I
never found an open grocery store, so I went to Todos Contentos and
ordered a couple of dishes to take home to Beatrice.

As I waited, I read some of the sports section of the paper.  It had a
list of the rugby and football games that had been canceled because of
the smoke, although it explained that the air "wasn't toxic", just
irritating and allergenic.  Maybe "tóxico" means something different
in Spanish than in English.

As I carried my order from the restaurant to the 107 bus stop, I
stopped by "Dashi", a sushi restaurant near the Buddha Bar.  The
newspaper blurbs outside the door explained that the chef had spent a
long time in Perú and had studied in California, so I hoped that
perhaps they might have some of the sushi flavors I've been missing
here in Argentina: maguro, uni, natto, unagi, ama-ebi, inari, and so
on.  I, went in to read the menu.  Although it had several pages
listing an impressive number of different kinds of sushi, more careful
reading revealed that they were made from a small number of basic
ingredients that did not include any of the above.  I was a little
disappointed but not surprised.

I walked on.  A couple sitting on some steps asked me what my mask was
for --- I explained it was for the smoke.  Wordlessly the man grinned
and lifted his cigarette to his lips and took a long drag, filling his
lungs with much denser smoke.  I laughed.

I eventually caught the 107 home.  Strangely, when I got on, the bus
was empty.

I wrote this on 2008-04-20, but hadn't gotten around to sending it out
until now.  It represents an approach to hardware architecture that's
basically fallen by the wayside, but it's interesting to think about,
even if nobody's found a way to make it practical yet.

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

#!/usr/bin/python
"""
Simulate a "dynamic dataflow" machine.

My understanding of this comes from
<http://thesz.mskhug.ru/browser/hiersort/doc/sdd2.pdf?format=raw>,
"Dynamic Dataflow Architecture with Token Sorting" and [some lecture
slides from Slovenia](http://csd.ijs.si/courses/dataflow/index.htm
Dataflow Architectures, Jurij Silc).  (Although I read some of the
background afterwards, including Ellen Spertus's thesis and a little
of the work of the other students under Arvind in the 1980s, and it
seems to have been more or less right.)

The basic idea, as I understand it, is as follows.  Every computation
has two arguments, and produces zero or more "message sends".  The
message sends consist of a tag, a destination port, and a datum.  Memory
consists of a bunch of stored message sends.  At each step, memory is
searched for two stored messages with the same tag.  Their data become
the two arguments to a new computation, which also has access to the
tag, and they are removed from memory.

To the memory, the tag can be just an opaque bitstring, but the
execution unit uses the tag to figure out what code to run in the new
computation.  But maybe only some of the tag bits matter for selecting
the code; the other tag bits are available to the computation, though.

The idea is that you can run a lot of these computation steps in
parallel, because you can fire off a new computation any time both of
its arguments become available.  The idea is *not* that this is some
kind of improvement in expressiveness or clarity of your program code.

Zefirov, Stepanov, and Klimov's paper gives this example matrix
multiply state machine:

    group G {i,j,k} { // indices i,j, and k represent the tag
                       // and are implicit inside group.
        node M (x,y) {
            send (x*y) to S.y;
        }
        node S (x,y) {
            if (k<N)
                 send (x+y) to S.x {i,j,k+1};
            else
                 send (x+y) to (array C receiver) {i,j};
        }
    }

They write:

> To start this program we ought to send zeroes to S.x {1..N,1..N,1},
> A[i][k] to M.x{i,1..N,k} and B[k][j] to M.y{1..N,k,j}. This is a
> task in itself, it requires O(N^3) sends. We will discuss it further
> below.

In the above, the capital letters and the indices in {} are part of
the tag, while the .x or .y is the "destination port".

How should that look in Python?  You could imagine phrasing

    send (x+y) to S.x {i,j,k+1};

as

    S[i, j, k+1].x = x + y

Then you could rewrite the above as

    node M {i,j,k} (x,y): S[i, j, k].y = x * y
    node S {i,j,k} (x,y):
        if k < N: S[i, j, k+1].x = x+y
        else: C[i, j] = x+y

But the "node" lines are a problem.

There's no way in Python to write a new kind of subroutine, nor to
write a normal lambda, nor to implicitly bind a name within a scope.
But you could write this:

    class M(node):
        def run((i, j, k), x, y): S[i, j, k].y = x * y
    class S(node):
        def run((i, j, k), x, y):
            if k < N: S[i, j, k+1].x = x+y
            else: C[i, j] = x+y

Or, with decorator syntax, this:

    @node
    def M((i, j, k), x, y): S[i, j, k].y = x * y
    @node
    def S((i, j, k), x, y):
        if k < N: S[i, j, k+1].x = x+y
        else: C[i, j] = x+y

So how can we make that work?

"""

class Machine:
    def __init__(self):
        self.memory = {}
        self.tasks = []
    def run(self):
        (node, indices), args = self.tasks.pop()
        self.message(("running", node, indices, args))
        node.run(indices, args)
    def retrieve(self, tag):
        "Destructive retrieve by tag."
        rv = self.memory[tag]
        del self.memory[tag]
        return rv
    def send(self, tag, argname, value):
        try: old_argname, old_value = self.retrieve(tag)
        except KeyError: self.memory[tag] = argname, value
        else:
            assert argname != old_argname
            self.postpone(tag, {argname: value, old_argname: old_value})
    def postpone(self, tag, data): self.tasks.append((tag, data))
    def message(self, message): print message

class Node:
    def __init__(self, machine, function, argnames, name = '(anon)'):
        self.machine = machine
        self.function = function
        assert len(argnames) == 2, argnames
        self.argnames = argnames
        self.name = name
    def __repr__(self): return '<node %s>' % self.name
    def __getitem__(self, indices): return NodeReceiver(self, indices)
    def assign(self, indices, argname, value):
        assert argname in self.argnames, (argname, self.argnames)
        self.machine.send((self, indices), argname, value)
    def run(self, indices, args):
        arglist = [indices] + [args[argname] for argname in self.argnames]
        self.function(*arglist)

class NodeReceiver:
    def __init__(self, node, indices):
        self.__dict__['node'], self.__dict__['indices'] = node, indices
    def __setattr__(self, attr, value):
        self.node.assign(self.indices, attr, value)

machine = Machine()

def magically_get_argument_names(function):
    code = function.func_code
    return code.co_varnames[:code.co_argcount]

def node(function):
    "A decorator."
    return Node(machine, function, magically_get_argument_names(function)[1:],
                name = function.func_name)

def test():
    "Regression test."

    @node
    def M((i, j, k), x, y): S[i, j, k].g = x * y

    @node
    def S((i, j, k), f, g):
        if k < N: S[i, j, k+1].f = f+g
        else: C[i, j] = f+g

    C = {}
    N = 4
    S[3, 4, 4].f = 3
    S[3, 4, 4].g = 4
    machine.run()
    assert machine.memory == {}
    assert machine.tasks == []
    assert C == {(3, 4): 7}

    S[1, 2, 4].f = 1
    M[1, 2, 4].x = 3
    M[1, 2, 4].y = 4
    machine.run()
    machine.run()
    assert C[1, 2] == 13

if __name__ == '__main__': test()