Sat, 31 May 2008

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

;;; Toggling of argument lists between horizontal and vertical.
;; For example, turn this: memset(bigstring, '\xe3', bigstringsize-1);
;; into this: memset(bigstring,
;;                   '\xe3',
;;                   bigstringsize-1);
;; or vice versa.

;; This was really useful at Airwave back in 2004, but I never
;; understood how it worked.  I miss it, so I reimplemented it, which
;; took me a couple of hours.  The idea is that when you're writing
;; out an argument list becomes too long to write on one line, you
;; have a single key to put each item on its own line; and that same
;; key does the inverse operation, if the argument list is already on
;; multiple lines.

;; Works on things other than argument lists, too, like {}-enclosed
;; blocks of statements, or list and dict displays in Python.

;; Bugs:
;; 
;; - doesn't escape from comments the way it escapes from strings
;; - doesn't drop the trailing separator when doing
;;   vertical-to-horizontal
;; - doesn't add a trailing separator when going
;;   horizontal-to-vertical
;; - removes trailing whitespace going horizontal-to-vertical, even
;;   before a close delimiter, even if there's leading whitespace
;;   after the open delimiter
;; - doesn't understand comments-to-the-end-of-the-line and how they
;;   screw up the transformation
;; - always puts the first argument on the same line as the open
;;   delimiter; it would be better to have a third format in which
;;   that first item indented on the next line instead.
;; - the functions should probably get a package prefix on their names

(defun inside-string-p ()
  "Returns true if we're inside a string."
  (cadddr (syntax-ppss)))

(defun backward-up-list-escaping-strings ()
  "Like backward-up-list, but works if we're inside a string."
  ;; probably should take comments into account too
  (while (inside-string-p) (backward-char))
  (backward-up-list))

(defun start-of-list ()
  "Go to inside the start of the currently enclosing list --- e.g. arg list."
  (interactive)
  (backward-up-list-escaping-strings)
  (down-list))


(defun end-of-list-p ()
  "Can we move no further forward without going up a list?"
  (looking-at "\\(\\s.\\|\n\\|\\s-\\)*\\s)"))

(defun horizontal-to-vertical-list ()
  "Turn a horizontal argument list into a vertical argument list.
This is written so that it only breaks at commas and semicolons; 
too bad for Lisps."
  (interactive)
  (save-excursion
    (start-of-list)
    (while (not (end-of-list-p))
      (while (not (or (end-of-list-p) (looking-at "\\s-*[;,]"))) ; skip over arg
        (forward-sexp))
      (while (looking-at "\\s-*[;,]") (forward-char)) ; skip over comma
      ;; now delete whitespace after comma
      (while (and (not (looking-at "\n")) (looking-at "\\s-")) (delete-char 1))
      (when (not (end-of-list-p)) ; insert newline if needed and indent
        (if (looking-at "\n") (forward-char) (insert "\n"))
        (indent-for-tab-command)))
    (if (current-list-horizontal-p) 
        (message "Couldn't find any commas or semicolons.  Are you editing Lisp?"))))

(defun vertical-to-horizontal-list ()
  (interactive)
  (save-excursion
    (backward-up-list-escaping-strings)
    (forward-list)
    (backward-char)
    (while (not (current-list-horizontal-p))
      (save-excursion (delete-indentation)))))

(defun current-list-horizontal-p ()
  "Returns nil unless the list around point is all on one line."
  (save-excursion
    (backward-up-list-escaping-strings)
    (let ((start (point)))
      (forward-list)
      (= 1 (count-lines start (point))))))


(defun toggle-list-orientation ()
  "Turn a horizontal list into a vertical one, or vice versa."
  (interactive)
  (if (current-list-horizontal-p)
      (horizontal-to-vertical-list)
    (vertical-to-horizontal-list)))

(global-set-key [f7] 'toggle-list-orientation)

Thu, 29 May 2008

(Available in HTML at <http://canonical.org/~kragen/raph-io.html>.)

(This is distinct and unrelated to Steve Dekorte's "Io" programming
language.)

The original paper, which I don't have a copy of, is:

> Raphael Levien, 1989, "Io: a new programming notation", SIGPLAN
> Notices 24(12) December 1989

There is a little material about Io online, including quotes from the
paper.  From
<http://hopl.murdoch.edu.au/showlanguage.prx?exp=4671&language=IO>:

> ## Coroutines ##

> Coroutines are an important concept of computing science, but few
> programming notations properly support them. It is surprising how easy
> they are to implement in Io.

> The idea of coroutines is to have two (or more) routines. When one of
> the routines gets to a point where it can no longer proceed (such as,
> when it needs more input), it is suspended, and another routine
> continues until it, in turn, can no longer continue (such as, when it
> has a value to output). Then, it is suspended and another routine is
> resumed.

> This is used, for example, in creating a stream. A stream carries a
> sequence of numbers, without consuming storage. Therefore, it can be
> infinite. Even in the case of a finite stream, though, it has an
> advantage over a linked list, because computation can begin
> immediately after the first number is known.

> The Io implementation of streams is analogous to linked lists. A
> stream takes two arguments. If there is no more data in the stream, it
> performs its first argument. Otherwise, it performs the second
> argument, with a data value and the continuation of the stream.

> Here we define the operator `count-stream`, and bind an infinite
> counting stream to the variable `s`.

>     count-streamO: ~ x out;
>     out x ~ null out;
>     +xl~x;
>     count-streamO x out.
>     count-stream: -..) ret;
>     ret .-9 null full;
>     count-streamO 0 full.
>     count-stream ~ s

> S has exactly the same structure as a linked list. In fact,
> `writelist s` will write `0 1 2 3 4 5...` on the screen.

There seem to be some OCR errors here.  I think `+x1~x` is supposed to
be `+ x 1 ~ x`, and I suspect (from Raphael Finkel's book, see below)
that `~` is actually supposed to be `->`.  So the definition of
count-stream0 should be as follows:

    count-stream0: -> x out;
            out x -> null out;
            + x 1 -> x;
            count-stream0 x out.

In Scheme:

    (define count-stream0
      (lambda (x out)
        (out x (lambda (null out)
                 (%+ x 1 (lambda (x) (count-stream0 x out)))))))

with the following definition of %+:

    (define (%+ a b cont) (cont (+ a b)))

I'm more mystified about the `count-stream` definition.  From the
text, perhaps the definition is as follows:

    count-stream: -> ret;
            ret -> null full;
            count-stream0 0 full.

Because then `s` gets `-> null full; count-stream0 0 full`, which
takes two arguments (as the text explains) and hands the second one
off to `count-stream0`, which performs it with a data value and the
continuation of the stream.

Raphael Finkel's 1995/1996 book ["Advanced Programming Language
Design"](http://www.nondot.org/sabre/Mirrored/AdvProgLangDesign/),
chapter 2, section 3, contains some more examples.

    write 5; write 6; terminate

which means, in Scheme:

    (write 5 (lambda () (write 6 (lambda () (terminate)))))

Then

    write-twice: -> number; write number; write number; terminate.

which means

    (define write-twice
      (lambda (number) 
        (write number 
               (lambda () (write number (lambda () (terminate)))))))
Then

    write-twice: -> number return;
            write number; write number; return.
    write-twice 7; write 9; terminate

Which means

    (define write-twice
      (lambda (number return)
        (write number (lambda () (write number 
                                        (lambda () (return)))))))
    (write-twice 7 (lambda () (write 9 (lambda () (terminate)))))

Then

    + 2 3 -> number; write number; terminate

which means

    (%+ 2 3 (lambda (number) (write number (lambda () (terminate)))))

Then

    count: -> start end return;
            write start;
            = start end (return);
            + start 1 -> new-start;
            count new-start end return.
    count 1 10; terminate

which means

    (define count 
      (lambda (start end return)
        (write start 
               (lambda ()
                 (%= start end return
                     (lambda () 
                       (%+ start 1 
                           (lambda (new-start)
                             (count new-start end return)))))))))

with the new definition of %=:

    (define (%= a b consequent alternate)
      (if (= a b) (consequent) (alternate)))

This is the CPS expansion of this:

    (define (count start end)
      (write start)
      (if (not (= start end)) (count (+ start 1) end)))

I don't know why there are parentheses in "= start end (return)"
in the Io example.  Perhaps it's an error introduced by Finkel.

One final example, showing the use of parentheses:

    make-pair: -> x y return; 
            user (-> client; client x y); return.

which means

    (define make-pair
      (lambda (x y return)
        (user (lambda (client) (client x y)) (lambda () (return)))))

Here's the definition of writelist mentioned above:

    writelist: -> list return;
            list return -> first rest;
            write first;
            writelist rest;
            return.
    emptylist: -> null notnull; null.
    cons: -> number list econtinuation;
            econtinuation -> null notnull;
            notnull number list.

Usefulness
----------

I wouldn't want to program in Io in the raw way described above; it's
pretty verbose and confusing.  But it's *much* clearer than Scheme for
expressing code in explicit CPS, for three simple reasons.

First, a series of nested lambdas is a flat structure rather than a
nested structure as in Scheme.

Second, the syntactic overhead of the lambda is a single punctuation
character, or possibly three, rather than ten characters including
some letters: `(lambda())`.

Third, as a result, in the usual case, the distance between the names
of arguments and the place they come from (that is, the procedure that
will eventually invoke the lambda that the arguments belong to) is
much less, and they appear as a unit rather than as things far apart.
`+ x 1 -> x;` is quite clear.  (Unfortunately, this closeness of
association is misleading sometimes; consider `out x -> null out;` in
the definition of `count-stream0`, where the `-> null out; ...` part
of the routine is suspended for some arbitrary period of time while
the rest of the program runs, and may in fact never resume.)

More Syntactic Sugar
--------------------

If you actually wanted to write programs in the language, you could
benefit from changing it to have a little bit more syntactic sugar.

### Nested expressions ###

For example, you could define

    count [+ start 1] end return

as an abbreviation for

    + start 1 -> new-start;
    count new-start end return

and for procedures that have only a single exit point, you could
imagine writing

    {-> number; write number; write number}

as an abbreviation for

    -> number return; write number; write number return

In cases where a "statement" contains more than a single set of square
brackets, the order of evaluation could be undefined, so that e.g.

    string-scan src [+ srcidx 1] [- len 1] c

could rewrite either to

    + srcidx 1 -> v1;
    - len 1 -> v2;
    string-scan src v1 v2 c

or to

    - len 1 -> v1;
    + srcidx 1 -> v2;
    string-scan src v2 v1 c

Or the order of evaluation could be defined; who cares?  However, it's
important for our sanity that this:

    string-scan src [+ srcidx 1]; foobar [- len 1]

rewrite to this:

    + srcidx 1 -> v1;
    string-scan src v1;
    - len 1 -> v2;
    foobar v2

and not this:

    + srcidx 1 -> v1;
    - len 1 -> v2;
    string-scan src v1;
    foobar v2

Note that the above transformation is just the CPS transformation in
Scheme for normal nested application expressions.  It's just a
thousand times more readable than usual because of the Io lambda
notation.

### One-argument lambda sugar ###

It might also be helpful to be able to write one-argument lambdas more
concisely, with an automatic name for "the last result".  In Python's
REPL and in Arc, this variable is called "_".  With this, for example,
you could write each of the following:

    count-stream: ; _ -> null full; count-stream0 0 full.

    + 2 3; write _; terminate

    make-pair: -> x y ret; user (; _ x y) ret.

Mostly this is duplicative with the []-nesting idea, though.  I'm not
sure which is better in the cases where both are applicable.
Consider this example:

    def render(text):
        body = str(markdown.Markdown(text))
        soup = BeautifulSoup.BeautifulSoup(body)

        headers = soup('h1')

In Io, that looks like this:

    render: -> text;
        markdown.Markdown text -> foo;
        str foo -> body;
        BeautifulSoup.BeautifulSoup body -> soup;

        soup "h1" -> headers; ...

With implicit single arguments:

    render: ;
        markdown.Markdown _;
        str _;
        BeautifulSoup.BeautifulSoup _;

        _ "h1" -> headers; ...

With nesting:

    render: -> text;
        [BeautifulSoup.BeautifulSoup [str [markdown.Markdown text]]] "h1" 
            -> headers; ...

The nested expressions are more compact, but in this case, I think the
implicit arguments are clearer.

### Conditionals ###

It would be nice if there were a way to conveniently rejoin streams of
control after a conditional.  For example, it would be nice to be able
to write

    if (= x y) (write "x y equal") (write "x y not equal");
    if (= x z) (write "x z equal") (write "x z not equal");
    if (= y z) (write "y z equal") (write "y z not equal");
    whatever

If the language had automatic currying, you could define this `if`
quite easily:

    if: -> cond result alt cont; cond (result cont) (alt cont).

You can use the above `if` definition without automatic currying if
you write out the arguments explicitly:

    if (-> a b; = x y a b) (-> c; write "x y equal" c) 
                           (-> c; write "x y not equal" c)

You could, however, imagine syntactic sugar for this as well.  For
example, this expression could expand into the above call to "if":

    = x y ? write "x y equal" : write "x y not equal"

As with the nested expressions, note that this is just the CPS
transformation for `if`.

Mon, 26 May 2008

(This is available in HTML at <http://canonical.org/~kragen/crypted-disk.html>.)

So I installed Debian on a new disk, then stuck my old disk into an
external USB enclosure to get the files I realized I had forgotten to
copy over.  But it didn't just mount automatically the way I expected
it to, because the disk was encrypted using Debian Etch's automatic
LVM-with-LUKS-and-`dm-crypt`-disk-encryption system, installed with
`partman-crypto`.

How I Figured It Out
--------------------

I ran `dmesg` to see where it was:

    kragen at thrifty:~/tmp$ dmesg
    ...
    usb 1-1: new full speed USB device using uhci_hcd and address 2
    usb 1-1: configuration #1 chosen from 1 choice
    SCSI subsystem initialized
    Initializing USB Mass Storage driver...
    scsi0 : SCSI emulation for USB Mass Storage devices
    usbcore: registered new driver usb-storage
    USB Mass Storage support registered.
    usb-storage: device found at 2
    usb-storage: waiting for device to settle before scanning
      Vendor: IC25N040  Model: ATMR04-0          Rev: 0000
      Type:   Direct-Access                      ANSI SCSI revision: 00
    usb-storage: device scan complete
    SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB)
    sda: Write Protect is off
    sda: Mode Sense: 27 00 00 00
    sda: assuming drive cache: write through
    SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB)
    sda: Write Protect is off
    sda: Mode Sense: 27 00 00 00
    sda: assuming drive cache: write through
     sda: sda1 sda2 < sda5 >
    sd 0:0:0:0: Attached scsi disk sda
    ...

So the device was `/dev/sda`, and I recalled that the encrypted LVM
group had been in the first logical partition.  In the man page for
`cryptsetup` (the version of `cryptsetup` that supports LUKS), I found
the command `luksOpen`, so I tried:

    kragen at thrifty:~/tmp$ sudo cryptsetup luksOpen /dev/sda5 externaldisk

And I succeeded in entering the key and unlocking the first key slot,
so `cryptsetup` created `/dev/mapper/externaldisk`.  Unfortunately the
contents were a logical volume group, not a filesystem by itself.  I
looked at `man lvm` and the like for a while without seeing anything
relevant to opening up an LVM volume group on an existing device.  I
did manage to find `lvscan`:

    kragen at thrifty:~/tmp$ sudo lvscan
      ACTIVE            '/dev/Debian/root' [110.45 GB] inherit
      ACTIVE            '/dev/Debian/swap_1' [1.10 GB] inherit

That's the filesystem I'm currently running off of.  So LVM is indeed
the way to access this stuff (the old setup was set up with the same
install CD) and it is not yet opened.

I wonder how it happens at boot-time?  I think it comes from some
script in some `initrd` file stored in `/boot`.

    kragen at thrifty:~/tmp$ ls /boot
    config-2.6.18-4-686          lost+found
    config-2.6.18-5-686          System.map-2.6.18-4-686
    config-2.6.18-6-686          System.map-2.6.18-5-686
    grub                         System.map-2.6.18-6-686
    initrd.img-2.6.18-4-686      vmlinuz-2.6.18-4-686
    initrd.img-2.6.18-4-686.bak  vmlinuz-2.6.18-5-686
    initrd.img-2.6.18-5-686      vmlinuz-2.6.18-6-686
    initrd.img-2.6.18-6-686
    kragen at thrifty:~/tmp$ file /boot/initrd.img-2.6.18-6-686 
    /boot/initrd.img-2.6.18-6-686: gzip compressed data, from Unix, last 
    modified: Tue Feb 19 02:21:30 2008, max compression
    kragen at thrifty:~/tmp$ gzip -dc /boot/initrd.img-2.6.18-6-686 > initrd.bin
    kragen at thrifty:~/tmp$ file initrd.bin
    initrd.bin: ASCII cpio archive (SVR4 with no CRC)

So I successfully un`gzip`ped the `initrd` --- and it's a `cpio` file?
Being an idiot, I tried extracting it with `cpio -o < initrd.bin`, but
that was unsuccessful.  Still being an idiot, I thought that perhaps
`file` was incorrect in its guess that it was a `cpio` file, and so I
tried mounting it as follows:

    sudo mount -o loop -t cramfs initrd.bin /mnt
    for type in $(ls /lib/modules/2.6.18-6-686/kernel/fs/); do 
        mount -o loop -t "$type" initrd.bin /mnt
    done

and that was unsuccessful too.  I gave up on trying to figure out what
the filesystem was and just `grep`ped it:

    kragen at thrifty:~/tmp$ grep -a lvm initrd.bin |less
    alias block-major-58-* lvm_mod
    alias char-major-109-* lvm_mod
    ...
    PREREQ="mdadm mdrun lvm2"
    if [ -e /scripts/local-top/lvm2 ]; then
	    cryptlvm=""
		    lvm=*)
			    cryptlvm=${x#lvm=}
    ...

Looks like gold!

> (It turns out that what I *should* have done was `cpio -i <
> initrd.bin`; `-o` is for creating `cpio` archives, not extracting
> files from them.  But I should have made a directory to do it in
> first.  The scripts excerpted above are `scripts/local-top/lvm` and
> `scripts/local-top/cryptroot`.)

    kragen at thrifty:~/tmp$ less +/lvm initrd.bin 

Navigating around a bit in there eventually finds me this:

    FSTYPE=''
    eval $(fstype < "$NEWROOT")

    # See if we need to setup lvm on the crypto device
    if [ "$FSTYPE" = "lvm" ] || [ "$FSTYPE" = "lvm2" ]; then
    ...
    if [ -z "$cryptlvm" ]; then
        ...
    elif ! activate_vg "/dev/mapper/$cryptlvm"; then
	    echo "cryptsetup: failed to setup lvm device"
	    return 1
    fi
    kragen at thrifty:~/tmp$ locate fstype
    /usr/lib/klibc/bin/fstype
    kragen at thrifty:~/tmp$ sudo sh -c '/usr/lib/klibc/bin/fstype < /dev/mapper/externaldisk '
    FSTYPE=lvm2
    FSSIZE=0

Great!  I now know that I really do have an LVM volume group to deal
with.  What does `activate_vg` do?

    activate_vg()
    {
	    local vg
	    vg="${1#/dev/mapper/}"
	    ...
	    vgchange -ay ${vg}
	    return $?
    }

Maybe I can do that.

    kragen at thrifty:~/tmp$ sudo vgchange -ay externaldisk
      Volume group "externaldisk" not found
    kragen at thrifty:~/tmp$ sudo vgchange -ay thrifty
      2 logical volume(s) in volume group "thrifty" now active
    kragen at thrifty:~/tmp$ sudo lvscan
      ACTIVE            '/dev/thrifty/root' [35.92 GB] inherit
      ACTIVE            '/dev/thrifty/swap_1' [1.10 GB] inherit
      ACTIVE            '/dev/Debian/root' [110.45 GB] inherit
      ACTIVE            '/dev/Debian/swap_1' [1.10 GB] inherit

Yes!  Apparently `-ay` means `--available y` --- that is, make it
available.  Good thing I remembered that the volume group was called
`thrifty`, the hostname of the system it was built for.  Is there a
way I could have figured that out?

    kragen at thrifty:~/tmp$ ls /sbin/vg*
    /sbin/vgcfgbackup   /sbin/vgcreate   /sbin/vgmerge    /sbin/vgs
    /sbin/vgcfgrestore  /sbin/vgdisplay  /sbin/vgmknodes  /sbin/vgscan
    /sbin/vgchange      /sbin/vgexport   /sbin/vgreduce   /sbin/vgsplit
    /sbin/vgck          /sbin/vgextend   /sbin/vgremove
    /sbin/vgconvert     /sbin/vgimport   /sbin/vgrename

Only `vgdisplay` looks particularly helpful, and it will only display
the volume group *after* we `vgchange -ay` it.  So I don't know.
(`vgscan`?)

    kragen at thrifty:~/tmp$ sudo mount /dev/thrifty/root /mnt
    [in dmesg, not shown on my screen]
    kjournald starting.  Commit interval 5 seconds
    EXT3 FS on dm-4, internal journal
    EXT3-fs: recovery complete.
    EXT3-fs: mounted filesystem with ordered data mode.
    kragen at thrifty:~/tmp$ 

Summary
-------

    kragen at thrifty:~/tmp$ dmesg
    ...
     sda: sda1 sda2 < sda5 >
    ...
    kragen at thrifty:~/tmp$ sudo cryptsetup luksOpen /dev/sda5 externaldisk
    kragen at thrifty:~/tmp$ sudo sh -c '/usr/lib/klibc/bin/fstype < /dev/mapper/externaldisk '
    FSTYPE=lvm2
    ...

(here you magically guess that the volume group name is `thrifty`)

    kragen at thrifty:~/tmp$ sudo vgchange -ay thrifty
      2 logical volume(s) in volume group "thrifty" now active
    kragen at thrifty:~/tmp$ sudo lvscan
      ACTIVE            '/dev/thrifty/root' [35.92 GB] inherit
      ACTIVE            '/dev/thrifty/swap_1' [1.10 GB] inherit
      ACTIVE            '/dev/Debian/root' [110.45 GB] inherit
      ACTIVE            '/dev/Debian/swap_1' [1.10 GB] inherit
    kragen at thrifty:~/tmp$ sudo mount /dev/thrifty/root /mnt

And then when you're done:       

    kragen at thrifty:~$ sudo umount /mnt
    kragen at thrifty:~$ sudo vgchange -an thrifty
      0 logical volume(s) in volume group "thrifty" now active
    kragen at thrifty:~$ sudo cryptsetup luksClose externaldisk


Sat, 24 May 2008

You can clone the repository of this code with git as follows:

    git clone http://canonical.org/~kragen/sw/mailman-milter

This is a very simple [milter](https://www.milter.org/) implementation
in pure Python, including a sample milter.  I'm using this to filter
spam under Postfix on panacea.canonical.org, which is Bad and Wrong
for several reasons:

1. I could do the same thing more easily with sender and recipient
   restrictions.
2. Postfix's "policy service" interface is about a thousand times
   saner than the Milter interface, and performs the same job.
   Writing to the Milter interface does mean that you could, in
   theory, use this with other MTAs that support milters.

But maybe it will be useful for somebody.

About Milters
-------------

Milters are a way to reject or modify mail before it goes into the
mail queue.  I care about this because it lets me bounce spam to its
real sender, not the forged from-address on the mail, so I don't
become part of the backscatter problem.

I'm using the sample milter to bounce unauthorized mail to some
Mailman mailing lists, on a machine with Postfix, so that our server
doesn't generate backscatter.  I'm probably going to switch away from
it shortly because it turns out Postfix has a built-in feature that
does more or less the same thing.

#!/usr/bin/python
"""Very simple pure-Python milter implementation.
Copyright (C) 2008  Kragen Javier Sitaker 2008

> This program is free software: you can redistribute it and/or modify
> it under the terms of the GNU General Public License as published by
> the Free Software Foundation, either version 3 of the License, or
> (at your option) any later version.

> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU General Public License for more details.

> You should have received a copy of the GNU General Public License
> along with this program.  If not, see <http://www.gnu.org/licenses/>.

Sendmail includes a `libmilter` but it has some disadvantages:

- `libmilter` is 2800 lines of C code (according to `sloccount`) with a
  poor security record, and its intended use is handling data that's
  so untrusted that you won't even deliver it as email;
- 2800 lines, which is 4100 physical lines, is a lot bigger and harder
  to audit than the 300+ lines in this program;
- `libmilter` wants to own the main loop of your program and spawn
  threads; neither of these are particularly compatible with Python,
  modern versions of Perl, or many other frameworks; and threads are
  not particularly compatible with good programming practice either,
  especially in C.  (However, this program uses threads in the same
  way as `libmilter`.)
  
I'm implementing this from Todd Vierling's wonderful protocol
documentation, 'The Sendmail Milter Protocol, Version 2', version 1.6.
<http://search.cpan.org/src/AVAR/Sendmail-PMilter-0.96/doc/milter-protocol.txt>

I use it with Postfix 2.3 with this configuration:

    smtpd_milters = inet:localhost:1112
    milter_default_action = tempfail

See <http://www.postfix.org/MILTER_README.html> for details.

My own purposes are fairly simple, so this is a very limited
implementation.

At this point, this program represents about six hours of work.

Problems and Future Work
------------------------

- the program doesn't recognize that `<kragen-tol at lists.canonical.org>`
  is the same as `<kragen-tol at canonical.org>` and just `<kragen-tol>`.
- it doesn't rotate its logs
- it doesn't use syslog
- it's not sure about whether it's called mailman-milter or minimilter
- it stores its configuration file in `/usr/local/minimilter` instead of
  `/etc/minimilter`
- it uses its own `ok` instead of doctest.
- it uses `while 1:` instead of `while True:`.
- uses `input` as a parameter.

Solution to the syslog problem suggested at
<http://www.mechanicalcat.net/richard/log/Python/Simple_usage_of_Python_s_logging_module>:

    import logging, logging.handlers

    logger = logging.getLogger('a_name')
    hdlr = logging.handlers.SysLogHandler(
        facility=logging.handlers.SysLogHandler.LOG_DAEMON)
    formatter = logging.Formatter('%(filename)s: %(levelname)s: %(message)s')
    hdlr.setFormatter(formatter)
    logger.addHandler(hdlr)

This doesn't seem to work though; the logger just discards messages
passed to it.

I am thinking about checking addresses against Mailman lists:

    from Mailman import MailList
    mlist = MailList.MailList(listname, lock=False)
    addrs = mlist.getRegularMemberKeys() + mlist.getDigestMemberKeys()

"""

import struct, sys, thread, socket, cgitb, StringIO, time

def ok(a, b):
    "One-line unit testing function."
    assert a == b, (a, b)

def log(msg):
    print "%s %s" % (time.time(), msg)
    sys.stdout.flush()
def debug(msg): pass
#debug = log

## Basic constants.

class smfir:
    """Namespace for reply codes."""
    addrcpt, delrcpt, accept, replbody, continue_ = '+-abc'
    discard, addheader, chgheader, progress, quarantine = 'dhmpq'
    reject, tempfail, replycode = 'rty'

class smfic:
    """Namespace for command codes."""
    mail, rcpt, optneg, quit, abort = 'MROQA'
    macro = 'D'


## Decoding packet contents: generic data handling.

# `dispatch_message` looks for a decoder it can call with the packet
# data and get back an args tuple to apply the appropriate method to.
# Format objects represent binary data formats; they have a "+" method
# that lets you concatenate them, and you can .encode() or .decode()
# to convert between tuples and binary data.

class TooManyValues(Exception):
    "Signals that you've asked a Format to encode more things than it can."
class Incomplete(Exception):
    "Raised when you try to decode an incomplete data structure."

class Format:
    "Base class for parsing objects."
    def __add__(self, other):
        return Concat(self, other)
    def encode(self, args):
        encoded, extra = self.partial_encode(args)
        if extra: raise TooManyValues
        return encoded

class Remaining(Format):
    """Sucks up remaining data as a string."""
    def width(self, val): return len(val)
    def decode(self, val): return (val,)
    def partial_encode(self, args):
        return args[0], args[1:]
remaining = Remaining()

class AscizMultiple(Remaining):
    "Parses a bunch of null-terminated strings as a string list."
    def decode(self, val):
        return (val.split('\0')[:-1],)
    def partial_encode(self, args): raise "Unimplemented"
asciz_multiple = AscizMultiple()

ok(asciz_multiple.decode("asdf\0fd\0c\0"), (['asdf', 'fd', 'c'],))

class Concat(Format):
    """Parses the concatenation of two data structures."""
    def __init__(self, a, b):
        self.a, self.b = a, b
    def decode(self, val):
        width = self.a.width(val)
        return self.a.decode(val[:width]) + self.b.decode(val[width:])
    def width(self, val):
        awidth = self.a.width(val)
        return awidth + self.b.width(val[awidth:])
    def partial_encode(self, args):
        a_encoded, a_extra = self.a.partial_encode(args)
        b_encoded, b_extra = self.b.partial_encode(a_extra)
        return a_encoded + b_encoded, b_extra

class _uint32(Format):
    def decode(self, val):
        try:
            return struct.unpack('>L', val)
        except struct.error, e:
            raise Incomplete(e)
    def width(self, val): return 4
    def partial_encode(self, args):
        return struct.pack('>L', args[0]), args[1:]
uint32 = _uint32()

ok(uint32.decode('\0\0\0\3'), (3,))
ok((uint32+uint32).decode('\0\0\0\3' '\0\0\0\4'), (3,4))
ok((uint32+uint32+uint32).decode('\0\0\0\3' '\0\0\0\4' '\0\0\0\6'), (3,4,6))
ok((uint32+uint32+uint32).encode((3,4,6)), '\0\0\0\3' '\0\0\0\4' '\0\0\0\6')
ok((uint32 + remaining).decode("\0\0\0\4boo"), (4, "boo"))
ok((uint32 + remaining).encode((4, "boo")), "\0\0\0\4boo")


## Decoding packet contents: milter protocol data formats.

# I follow Vierling's terminology: the globs sent over the socket
# including the leading byte count is a "packet", and the content of
# such a glob (which begins with an opcode byte) is a "message".

smfic_optneg_format = uint32 + uint32 + uint32

class Milter:
    """An abstract base milter."""
    def smfic_optneg(self, version, actions, protocol):
        "Option negotiation."
        return 'O' + smfic_optneg_format.encode((version, 0, 0))

decoders = {
    'smfic_mail': asciz_multiple,
    'smfic_rcpt': asciz_multiple,
    'smfic_optneg': smfic_optneg_format,
}

class Abort(Exception):
    "Raised on SMFIC_ABORT; supposed to reset milter state."
class Quit(Exception):
    "Raised on SMFIC_QUIT; supposed to close connection."

packet_format = uint32 + remaining
def empacketize(val):
    return packet_format.encode((len(val), val))

def _dispatch_message(milter, message):
    # XXX I think the handling of zero-length messages here is okay:
    # The exception propagates up the stack and kills the milter
    # server thread, and hopefully gets logged.  The same thing
    # happens if a decoder below raises Incomplete.
    command_code = message[0]
    debug("message %r, %r" % (command_code, message))

    # XXX move these into the Milter object?
    if command_code == smfic.abort:
        raise Abort # XXX: do the same for SMFIC_BODYEOB?
    if command_code == smfic.quit:
        raise Quit
    if command_code == smfic.macro:
        return []

    map = {smfic.mail: 'smfic_mail',
           smfic.rcpt: 'smfic_rcpt',
           smfic.optneg: 'smfic_optneg'}
    selector = map.get(command_code)
    if selector is None: return smfir.continue_
    args = decoders[selector].decode(message[1:])
    debug("got message %r => %s%s" % (command_code, selector, args))
    return getattr(milter, selector)(*args)

def dispatch_message(milter, message):
    """Parse a message from the MTA and get a response from the milter.

    The message should already have its initial `len` field removed.

    XXX should this move into the Milter class?

    """
    response = _dispatch_message(milter, message)
    if not isinstance(response, list): response = [response]
    return ''.join(map(empacketize, response))

ok(smfic.optneg, 'O')
ok(dispatch_message(Milter(), 'O' '\0\0\0\2' '\0\0\0\x3f' '\0\0\0\x7f'),
   '\0\0\0\x0d' 'O' '\0\0\0\2' '\0\0\0\0' '\0\0\0\0')

def parse_packet(buffer):
    """Given buffer contents, split off a complete packet
    if possible.

    Returns (packetbody, remainingdata) tuple, or raises
    Incomplete.
    """
    # It's a little misleading that we use packet_format here --- the
    # actual packet may end before the end of the buffer.
    length, contents = packet_format.decode(buffer)
    if len(contents) < length: raise Incomplete
    # So we slice it here.
    return (contents[:length], contents[length:])

ok(parse_packet('\0\0\0\4abcde'), ('abcd', 'e'))


## Control flow of milter protocol.

def loop(input, output, milter_factory):
    "Run one or more milters against abstract input and output."
    buf = ""
    milter = milter_factory()
    while 1:
        try:
            message, buf = parse_packet(buf)
        except Incomplete:
            data = input(4096)
            debug("got %r" % data)
            if not data:
                return
            buf += data
            continue

        try:
            answer = dispatch_message(milter, message)
        except Abort:
            milter = milter_factory()
        except Quit:
            return
        else:
            debug("responding with %r" % answer)
            output(answer)

_testresponses = []
_source = StringIO.StringIO(
    # this is a little dodgy because we wouldn't ever really get
    # multiple smfic_optneg packets
    empacketize(smfic.optneg + smfic_optneg_format.encode((2, 0x3f, 0x7f))) +
    empacketize(smfic.optneg + smfic_optneg_format.encode((3, 0x3f, 0x7f))) +
    empacketize(smfic.quit))
loop(_source.read, _testresponses.append, Milter)
ok(_testresponses, [
    empacketize(smfic.optneg + smfic_optneg_format.encode((2, 0, 0))),
    empacketize(smfic.optneg + smfic_optneg_format.encode((3, 0, 0)))])

# tests to make sure unexpected EOF is handled in some way other than
# just spinning.
loop(StringIO.StringIO("").read, "expect no responses", Milter)
loop(StringIO.StringIO("\0\0\0\1").read, "expect no responses", Milter)

# A test with real data from Postfix 2.3.8-2+b1
_realdata = ('\0\0\0\rO\0\0\0\x02\0\0\0=\0\0\0\x7f'
             '\0\0\0VDCj\0watchdog-qemu-image.local\0{daemon_name}\0'
             'watchdog-qemu-image.local\0v\0Postfix 2.3.8\0'
             '\x00\x00\x00\x18Clocalhost\x004\x00\x00127.0.0.1\x00')
_testresponses = []
loop(StringIO.StringIO(_realdata).read, _testresponses.append, Milter)
ok(_testresponses, [
    empacketize(smfic.optneg + smfic_optneg_format.encode((2, 0, 0))),
    '',                # no response for D (macro definition) messages
    empacketize(smfir.continue_)])

# test for commands with default handling (again, real data from Postfix)
_realdata2 = '\x00\x00\x00\x02DH\x00\x00\x00\x1aHthis-is-my-helo-hostname\x00'
_testresponses = []
loop(StringIO.StringIO(_realdata2).read, _testresponses.append, Milter)
ok(_testresponses, ['', empacketize(smfir.continue_)])


def socket_loop(sock, milter_factory):
    "Run one or more milters on an open socket connection."
    loop(sock.recv, sock.send, milter_factory)
    log("connection closed")
    sock.close()

def threaded_server(port, milter_factory):
    "Run a threaded server on localhost."
    sock = socket.socket()
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sockaddr = ('127.0.0.1', port)
    sock.bind(sockaddr)
    sock.listen(5)

    log("listening on %s" % (sockaddr,))

    while 1:
        (conn, addr) = sock.accept()
        # XXX is there a chance of resource exhaustion here?
        thread.start_new_thread(socket_loop, (conn, milter_factory))
        del conn                        # for GC


## My specific milter.
# Eventually this should go into a file of its own.

class RecipMapMilter(Milter):
    """A simple milter that filters on allowed senders for
    some recipients.

    For recipients not in the map, all senders are allowed.

    For recipients in the map, only specified senders are
    allowed.

    This milter is not case-sensitive.
    
    """
    def __init__(self, recipmap):
        self.recipmap = recipmap
    def normalize_addr(self, addr):
        """Add angle brackets to an address if they're missing.

        XXX having to call this explicitly is bug-prone.

        """
        if not addr.startswith('<'): addr = '<' + addr
        if not addr.endswith('>'): addr = addr + '>'
        return addr.lower()
    def smfic_mail(self, strings):
        "Respond to a MAIL FROM: command."
        self.sender = self.normalize_addr(strings[0])
        log("sender is %s" % self.sender)
        return smfir.continue_
    def smfic_rcpt(self, strings):
        "Respond to an RCPT TO: command."
        recip = self.normalize_addr(strings[0])
        log("recipient is %s" % recip)
        if recip in self.recipmap and self.sender not in self.recipmap[recip]:
            log("rejecting recipient %s" % recip)
            return smfir.reject
        else:
            log("accepting recipient %s" % recip)
            return smfir.continue_

def _testRecipMapMilter():
    # unauthorized case
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['<foo at bar>']), smfir.continue_)
    ok(milter.smfic_rcpt(['<somebody at somewhere>']), smfir.reject)
    ok(milter.smfic_rcpt(['<elsebody at somewhere>']), smfir.continue_)

    # authorized case
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['<privileged at elsewhere>']), smfir.continue_)
    ok(milter.smfic_rcpt(['<somebody at somewhere>']), smfir.continue_)
    ok(milter.smfic_rcpt(['<elsebody at somewhere>']), smfir.continue_)

    # missing angle brackets case.  If a spammer's lousy SMTP
    # implementation fails to supply angle brackets, Postfix
    # 2.3.whatever passes along their lack of angle brackets to the
    # milter.  The milter protocol is lousy through and through.
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['foo at bar']), smfir.continue_)
    ok(milter.smfic_rcpt(['somebody at somewhere']), smfir.reject)
    ok(milter.smfic_rcpt(['elsebody at somewhere']), smfir.continue_)
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['privileged at elsewhere']), smfir.continue_)
    ok(milter.smfic_rcpt(['somebody at somewhere']), smfir.continue_)
    ok(milter.smfic_rcpt(['elsebody at somewhere']), smfir.continue_)

    # Second angle bracket missing case.
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['foo at bar']), smfir.continue_)
    ok(milter.smfic_rcpt(['somebody at somewhere>']), smfir.reject)

    # case-smashing case
    milter = RecipMapMilter(
        {'<somebody at somewhere>': ['<privileged at elsewhere>']})
    ok(milter.smfic_mail(['<foo at bar>']), smfir.continue_)
    ok(milter.smfic_rcpt(['<someBody at soMewhere>']), smfir.reject)
    ok(milter.smfic_rcpt(['<elseBody at soMewhere>']), smfir.continue_)

def discard_stdout(thunk):
    "Temporarily redirect stdout to a bit bucket for testing."
    old_stdout = sys.stdout
    sys.stdout = StringIO.StringIO()
    try: thunk()
    finally: sys.stdout = old_stdout

discard_stdout(_testRecipMapMilter)

if __name__ == '__main__':
    try:
        _, recipmapname, port, logfile = sys.argv
    except:
        sys.stderr.write("usage: %s <mapfile> <portnum> <logfile>\n"
                         % (sys.argv[0]))
    else:
        sys.stdout = file(logfile, 'a')
        cgitb.enable(format='text')

        recipmap = eval(file(recipmapname).read())
        threaded_server(int(port), lambda: RecipMapMilter(recipmap))

Thu, 22 May 2008

So I've been playing with QEMU, which lets you run a virtual computer
inside your normal computer.  At the moment I'm using it to create a
reproducible development environment on a project I'm working on.

Among QEMU's features is the ability to save a virtual machine
snapshot, which includes the entire state of the virtual computer:
memory, CPU, even disk.  This seems similar to KeyKOS's checkpointing
facility, although it seems to be a bit slower, maybe to the point of
being less useful.  (It seems to do all of its I/O before continuing
to run, rather than doing some kind of copy-on-write.  It seems like
fork() might be sufficient to get good copy-on-write performance.)

(In one case, a VM snapshot of a 128MB VM took 53MB of space.)

But suppose you ran your normal GUI session inside of QEMU.  Maybe
every few minutes, you could do a backup of your live session to a
server somewhere nearby.

Benefits
--------

If you do this, you can transport your GUI session from one machine to
another --- something the term "VNC", an abbreviation for "Virtual
Network Computer", promised but never delivered.  If your machine ever
crashes or gets stolen, you can restore from the previous checkpoint;
sometimes this might be worth doing even if only a single application
within it has crashed.  And you can have a large number of GUI
sessions for different users in suspended animation on your disk.

This kind of thing could give users in, for example, an internet cafe,
the freedom to really customize their environment.  Rather than
storing all of their state on a web site, they could store much of it
on the servers at the internet cafe itself, as if they owned it.  They
could install software, keep their files, and so on; and whenever they
came into the cafe, their session would be waiting for them, just as
it was when they left it.

Problems
--------

There are a lot of times these days where you'll want to run
stuff that doesn't run very well inside QEMU: MPlayer, Art of
Illusion, anything with SSE, MMX, or 3-D acceleration.  I think that's
kind of a minor problem, since the data involved in that part of the
system (the latest frame of a movie, say) is usually quite transient
and easy to recreate.

Also, it's not uncommon these days for a GUI session to fill up
gigabytes of RAM, and all of that RAM could in theory change its
contents about once a second.  So you still might end up copying all
or the vast majority of the memory pages during a snapshot.

Tue, 20 May 2008

(There is an HTML version at <http://canonical.org/~kragen/strlen-utf8.html>.)

On IRC, [Aristotle Pagaltzis](http://plasmasturm.org) was pondering
how much performance variable-width encodings such as UTF-8 actually
cost, because it's commonly suggested that fixed-width encodings such
as ISO-8859-1 and UCS-4 are much faster.

He suggested:

> Huh, it just occurs to me that `strlen` is not at all expensive on
> UTF-8-encoded strings.  Not exactly as fast, but if you write it
> in asm, it only takes one extra instruction to count characters in
> UTF-8 vs those in an 8-bit encoding, per character. So, if you
> factor in cache misses, it should make no measurable
> difference. All you lose with a variable-width encoding is direct
> random access to arbitrary indices in the string, which is
> basically a non-use case.

It turned out that he was partly wrong, but mostly right.  And along
the way, we discovered that GCC's standard implementation of `strlen`
was quite pessimal.

I'm using Linux on a 700MHz Pentium III laptop with GCC 4.1.2, using
just the `-O` flag unless otherwise specified.

My First Assembly Version
-------------------------

First I thought about how to write `strlen`, and came up with this:

            .global my_strlen_s
    my_strlen_s:
            push %esi                
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
            ## repnz lodsb doesn't work because lodsb doesn't update ZF
    loop:   lodsb
            test %al, %al
            loopnz loop
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

> For those who aren't well-versed in 80386 assembly language:
> 
> - `lodsb` reads a byte from memory at `%esi`, puts it in `%al`, and
>   increments `%esi`;
> - `loopnz` decrements `%ecx` each time through the loop, and jumps
>   back to the label `loop:` if `%ecx` isn't zero and if the zero
>   flag `ZF` isn't set (that's the "nz" part);
> - `test %al, %al` sets the zero flag `ZF` if `%al` is zero (the C
>   string terminator), among other things.
> - after the loop body has run N times, `%ecx` is -N, so `not %eax`
>   converts the negative number -N from `%ecx` into a positive number
>   N-1.
> - you have to push `%esi` because it's a callee-saves register.  (To
>   my surprise.)
> 
> But there's a `scasb` instruction which I should have used instead
> of `lodsb; test`.

The inner loop there is three instructions.  

GCC's Assembly Version
----------------------

Then I looked at how GCC does `strlen`.  It turns out it inlines it,
even without any extra optimization flags.  Here's a sample call,
albeit with optimization:

     80484d5:	bf 00 99 04 08       	mov    $0x8049900,%edi
     80484da:	fc                   	cld    
     80484db:	b9 ff ff ff ff       	mov    $0xffffffff,%ecx
     80484e0:	b0 00                	mov    $0x0,%al
     80484e2:	f2 ae                	repnz scas %es:(%edi),%al
     80484e4:	f7 d1                	not    %ecx
     80484e6:	49                   	dec    %ecx
     80484e7:	89 4c 24 04          	mov    %ecx,0x4(%esp)

There the inner loop is just the `repnz scas`.  I'd forgotten about
`SCAS`.

A C Version
-----------

Here's a reasonable `strlen` in C:

    int my_strlen(char *s) {
      int i = 0;
      while (*s++) i++;
      return i;
    }

This compiles to the following:

    080483c4 <my_strlen>:
     80483c4:	55                   	push   %ebp
     80483c5:	89 e5                	mov    %esp,%ebp
     80483c7:	8b 55 08             	mov    0x8(%ebp),%edx
     80483ca:	b8 00 00 00 00       	mov    $0x0,%eax
     80483cf:	80 3a 00             	cmpb   $0x0,(%edx)
     80483d2:	74 0c                	je     80483e0 <my_strlen+0x1c>
     80483d4:	b8 00 00 00 00       	mov    $0x0,%eax
     80483d9:	40                   	inc    %eax
     80483da:	80 3c 10 00          	cmpb   $0x0,(%eax,%edx,1)
     80483de:	75 f9                	jne    80483d9 <my_strlen+0x15>
     80483e0:	5d                   	pop    %ebp
     80483e1:	c3                   	ret    

So here the inner loop is again three instructions: `inc %eax`; `cmpb
0, (%eax,%edx,1)`; `jne`.  It's been optimized down to `while (s[i])
i++;`.  The loop termination test is duplicated above the top of the
loop for the empty-string case.

My UTF-8 Assembly Version
-------------------------

So then I thought about how to do what Aristotle was suggesting.  In
UTF-8, bytes that start new characters begin either with binary 0 or
binary 11; the second and subsequent bytes of multibyte characters
have binary 10 as their high bits.  So to count the characters, you
just have to count the bytes that don't begin with binary 10.

I tried this:

            .global my_strlen_utf8_s
    my_strlen_utf8_s:
            push %esi
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
    loop2:  lodsb
            test $0x80, %al
            jz ascii                # format 0xxx xxxx
            test $0x40, %al
            jz loop2                # format 10xx xxxx: doesn't start new char
    ascii:  test %al, %al
            loopnz loop2
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

So here we jump back to `loop2`, rather than decrementing `%ecx` with
`loopnz`, in the case where the byte starts with 10.  And we can skip
the `test %al, %al` test, since 0000 0000 doesn't start with 10.

The inner loop of this version has 5 instructions, including two taken
conditional branches, in the usual ASCII case, and 7 instructions for
non-ASCII bytes, rather than 3 instructions per byte.  That's only two
extra instructions in the "usual" case, but if every instruction were
one cycle, that would still be a 67% increase in run-time.

Counting instructions and adding up their cycle count isn't a very
accurate way to measure performance in these superscalar days, though.

My UTF-8 C Version
------------------

A C version looks like this:

    int my_strlen_utf8_c(char *s) {
      int i = 0, j = 0;
      while (s[i]) {
        if ((s[i] & 0xc0) != 0x80) j++;
        i++;
      }
      return j;
    }

GCC compiles it to this:

    080483e2 <my_strlen_utf8_c>:
     80483e2:	55                   	push   %ebp
     80483e3:	89 e5                	mov    %esp,%ebp
     80483e5:	8b 55 08             	mov    0x8(%ebp),%edx
     80483e8:	0f b6 02             	movzbl (%edx),%eax
     80483eb:	b9 00 00 00 00       	mov    $0x0,%ecx
     80483f0:	84 c0                	test   %al,%al
     80483f2:	74 23                	je     8048417 <my_strlen_utf8_c+0x35>
     80483f4:	b9 00 00 00 00       	mov    $0x0,%ecx
     80483f9:	0f be c0             	movsbl %al,%eax
     80483fc:	25 c0 00 00 00       	and    $0xc0,%eax
     8048401:	3d 80 00 00 00       	cmp    $0x80,%eax
     8048406:	0f 95 c0             	setne  %al
     8048409:	0f b6 c0             	movzbl %al,%eax
     804840c:	01 c1                	add    %eax,%ecx
     804840e:	0f b6 42 01          	movzbl 0x1(%edx),%eax
     8048412:	42                   	inc    %edx
     8048413:	84 c0                	test   %al,%al
     8048415:	75 e2                	jne    80483f9 <my_strlen_utf8_c+0x17>
     8048417:	89 c8                	mov    %ecx,%eax
     8048419:	5d                   	pop    %ebp
     804841a:	c3                   	ret    

An inner loop of 10 instructions --- but containing only a single
conditional jump, the `jne` at the bottom.  It uses the `and`; `cmp`;
`setne`; `movzbl` sequence to put either a 0 or a 1 into `%eax`,
depending on whether the byte fetched began with 10, and adds the
result into `%ecx` each time through the loop.

Aristotle's UTF-8 Assembly Version
----------------------------------

So after all this, I chatted with Aristotle some more, and it turned
out he had a much cleverer trick up his sleeve than I had thought ---
or, in fact, than he had thought.  He wrote:

> But wow, my code is *much* faster than any of the other variants.
> Unexpectedly.

Here's his version:

            .global ap_strlen_utf8_s
    ap_strlen_utf8_s:
            push %esi
            cld
            mov 8(%esp), %esi
            xor %ecx, %ecx
    loopa:  dec %ecx
    loopb:  lodsb
            shl $1, %al
            js loopa
            jc loopb
            jnz loopa
            mov %ecx, %eax
            not %eax
            pop %esi
            ret

In this case, the inner loop is 6 instructions, but as few as 3 of
them can execute.  I hadn't realized that you could get the top two
bits of a byte into the carry and sign flags with a single `shl`
instruction like that!  Aristotle explains:

> `Js` catches all bytes of the form x1xxxxxx. `Jc` catches 1xxxxxxx,
> but because `js` came first, that can only have been 10xxxxxx; and
> `jnz` then catches all 00xxxxxx other than all-0.  This runs about
> 3x as fast as your `my_strlen_s` --- most of the time, anyway.

Performance Results
-------------------

So how do these different approaches fare?  I wrote a program that
creates a 32MB string and timed the different functions on it, in
seconds, using wall-clock time.  Here are the results from one run,
sorted with `sort -t: -k1 -k3 -ns`.  The first few lines are various
functions' return values on the given strings.

    "": 0 0 0 0 0 0
    "hello, world": 12 12 12 12 12 12
    "naïve": 6 6 6 5 5 5
    "こんにちは": 15 15 15 5 5 5
    1: all 'a':
    1:                my_strlen(string) =   33554431: 0.227555
    1:         ap_strlen_utf8_s(string) =   33554431: 0.299494
    1:                   strlen(string) =   33554431: 0.314887
    1:         my_strlen_utf8_c(string) =   33554431: 0.380355
    1:              my_strlen_s(string) =   33554431: 0.432079
    1:         my_strlen_utf8_s(string) =   33554431: 0.525443
    2: all '\xe3':
    2:                my_strlen(string) =   33554431: 0.224037
    2:         ap_strlen_utf8_s(string) =   33554431: 0.299537
    2:                   strlen(string) =   33554431: 0.311552
    2:         my_strlen_utf8_c(string) =   33554431: 0.378162
    2:              my_strlen_s(string) =   33554431: 0.436755
    2:         my_strlen_utf8_s(string) =   33554431: 0.589165
    3: all '\x81':
    3:                my_strlen(string) =   33554431: 0.225011
    3:         ap_strlen_utf8_s(string) =          0: 0.313525
    3:                   strlen(string) =   33554431: 0.316182
    3:         my_strlen_utf8_s(string) =          0: 0.322959
    3:         my_strlen_utf8_c(string) =          0: 0.390958
    3:              my_strlen_s(string) =   33554431: 0.432342

The 33554431 and 0 numbers are the return values; this ensures that
GCC doesn't optimize out the `strlen` call.

So, on my CPU, the C version of `strlen` took about 28% less time than
the built-in inlined one for this long string; it only uses two
registers instead of the three used by the built-in inlined one (the
one that uses `repnz scasb`); and they both seem to be about 12 bytes.
I don't know why GCC inlines the worse one. Most likely it used to be
faster than whatever GCC generated at the time and hasn't been
revisited.

It's worth noting that while my C version of `strlen` was always
faster than the built-in version, Aristotle's UTF-8 version was always
in between.

On Aristotle's Core 2 Duo 1.8GHz (also with GCC 4.1.2 and `-O`), the
difference was very much greater.  Here are his results:

    "": 0 0 0 0 0 0
    "hello, world": 12 12 12 12 12 12
    "naïve": 6 6 6 5 5 5
    "こんにちは": 15 15 15 5 5 5
    1: all 'a':
    1:                my_strlen(string) =   33554431: 0.025906
    1:         ap_strlen_utf8_s(string) =   33554431: 0.039629
    1:         my_strlen_utf8_c(string) =   33554431: 0.096041
    1:                   strlen(string) =   33554431: 0.114821
    1:              my_strlen_s(string) =   33554431: 0.116529
    1:         my_strlen_utf8_s(string) =   33554431: 0.132648
    2: all '\xe3':
    2:                my_strlen(string) =   33554431: 0.025912
    2:         ap_strlen_utf8_s(string) =   33554431: 0.039583
    2:         my_strlen_utf8_c(string) =   33554431: 0.095699
    2:                   strlen(string) =   33554431: 0.114452
    2:              my_strlen_s(string) =   33554431: 0.114622
    2:         my_strlen_utf8_s(string) =   33554431: 0.136109
    3: all '\x81':
    3:                my_strlen(string) =   33554431: 0.026112
    3:         my_strlen_utf8_s(string) =          0: 0.039656
    3:         ap_strlen_utf8_s(string) =          0: 0.039661
    3:         my_strlen_utf8_c(string) =          0: 0.096416
    3:              my_strlen_s(string) =   33554431: 0.115327
    3:                   strlen(string) =   33554431: 0.116629

All of this code is online in two files:

* [mystrlen.c](http://pobox.com/~kragen/sw/mystrlen.c)
* [mystrlen.s](http://pobox.com/~kragen/sw/mystrlen.s)

Conclusions
-----------

1. GCC is better at writing x86 assembly than I am.  No surprise
   there.  Even when its inner loop is 10 instructions, it beats my
   three-instruction inner loops for speed.

2. Aristotle is better at writing x86 assembly than GCC is.

3. Aristotle was essentially correct: the penalty for counting UTF-8
   characters, or indexing into or iterating over the characters of a
   UTF-8 string, is very small.

4. However, there is a speed penalty.  Although GCC's built-in
   `strlen` is much slower than Aristotle's function, a
   straightforward byte-counting C `strlen` compiled with optimization
   is faster still.

5. GCC should change to use the straightforward byte-counting C
   `strlen` instead of what it currently inlines.  The version of
   strlen that GCC inlines is worse than the one it compiled from C in
   every way: it's more instructions, more bytes of machine code, four
   times slower, and uses more registers (one of which is a
   callee-saves register!).

6. People probably shouldn't worry about the efficiency of counting
   and iterating over characters in UTF-8 strings, at least not if
   they were using null-terminated C strings before.

(I wrote this 2008-04-19, when Buenos Aires was still under a blanket
of heavy smoke.)

I went out in the smoke tonight, Saturday night, to try to get food
from Chinatown, despite Beatrice's protestations that 20:00 was too
late.

As I slowly walked the few blocks to the route 107 bus stop, three 107
buses passed me.  I waited at the bus stop as two 107 buses passed
going the other way; while I waited, standing in the street, other
would-be passengers accumulated: a bald man with gray hair cuddling
and kissing with his middle-aged girlfriend as they stood in the
street behind me, and two teenagers.

Eventually I gave up on the bus and hailed a passing taxi, which I
took to a Citibank near Chinatown, where I extracted money from my
bank account via an ATM.

The sidewalk cafes in the commercial district were full of people,
despite the smoke blanketing the city; I recognized an acquaintance
waitressing at the the restaurant "1810", where we first tasted
Argentine empanadas.  A few blocks away, as I walked in the direction
of the 107 bus route and Chinatown, I found a long line of mostly old
people.  I asked a young man standing in line what the line was for.
He didn't answer for a moment, and then without meeting my eyes, he
explained that it was for bread.

I walked along what I thought was the 107 bus route, but I arrived in
Chinatown before seeing any more 107 buses.  The store I had hoped to
go to had closed at 20:30; I walked around looking for an open store,
so I could buy peanut butter, ginger root, and packaged ramen.  (Ramen
only costs $2 a package there.)

I passed a couple of young men with small shopping carts full to the
brim of 1.5-liter Quilmes beer bottles, waiting to be let into an
apartment complex; elsewhere I passed one or another sentry waiting at
a door, presumably to let in people who had gone out.

After walking about six blocks through almost all of Chinatown, I
never found an open grocery store, so I went to Todos Contentos and
ordered a couple of dishes to take home to Beatrice.

As I waited, I read some of the sports section of the paper.  It had a
list of the rugby and football games that had been canceled because of
the smoke, although it explained that the air "wasn't toxic", just
irritating and allergenic.  Maybe "tóxico" means something different
in Spanish than in English.

As I carried my order from the restaurant to the 107 bus stop, I
stopped by "Dashi", a sushi restaurant near the Buddha Bar.  The
newspaper blurbs outside the door explained that the chef had spent a
long time in Perú and had studied in California, so I hoped that
perhaps they might have some of the sushi flavors I've been missing
here in Argentina: maguro, uni, natto, unagi, ama-ebi, inari, and so
on.  I, went in to read the menu.  Although it had several pages
listing an impressive number of different kinds of sushi, more careful
reading revealed that they were made from a small number of basic
ingredients that did not include any of the above.  I was a little
disappointed but not surprised.

I walked on.  A couple sitting on some steps asked me what my mask was
for --- I explained it was for the smoke.  Wordlessly the man grinned
and lifted his cigarette to his lips and took a long drag, filling his
lungs with much denser smoke.  I laughed.

I eventually caught the 107 home.  Strangely, when I got on, the bus
was empty.

I wrote this on 2008-04-20, but hadn't gotten around to sending it out
until now.  It represents an approach to hardware architecture that's
basically fallen by the wayside, but it's interesting to think about,
even if nobody's found a way to make it practical yet.

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

#!/usr/bin/python
"""
Simulate a "dynamic dataflow" machine.

My understanding of this comes from
<http://thesz.mskhug.ru/browser/hiersort/doc/sdd2.pdf?format=raw>,
"Dynamic Dataflow Architecture with Token Sorting" and [some lecture
slides from Slovenia](http://csd.ijs.si/courses/dataflow/index.htm
Dataflow Architectures, Jurij Silc).  (Although I read some of the
background afterwards, including Ellen Spertus's thesis and a little
of the work of the other students under Arvind in the 1980s, and it
seems to have been more or less right.)

The basic idea, as I understand it, is as follows.  Every computation
has two arguments, and produces zero or more "message sends".  The
message sends consist of a tag, a destination port, and a datum.  Memory
consists of a bunch of stored message sends.  At each step, memory is
searched for two stored messages with the same tag.  Their data become
the two arguments to a new computation, which also has access to the
tag, and they are removed from memory.

To the memory, the tag can be just an opaque bitstring, but the
execution unit uses the tag to figure out what code to run in the new
computation.  But maybe only some of the tag bits matter for selecting
the code; the other tag bits are available to the computation, though.

The idea is that you can run a lot of these computation steps in
parallel, because you can fire off a new computation any time both of
its arguments become available.  The idea is *not* that this is some
kind of improvement in expressiveness or clarity of your program code.

Zefirov, Stepanov, and Klimov's paper gives this example matrix
multiply state machine:

    group G {i,j,k} { // indices i,j, and k represent the tag
                       // and are implicit inside group.
        node M (x,y) {
            send (x*y) to S.y;
        }
        node S (x,y) {
            if (k<N)
                 send (x+y) to S.x {i,j,k+1};
            else
                 send (x+y) to (array C receiver) {i,j};
        }
    }

They write:

> To start this program we ought to send zeroes to S.x {1..N,1..N,1},
> A[i][k] to M.x{i,1..N,k} and B[k][j] to M.y{1..N,k,j}. This is a
> task in itself, it requires O(N^3) sends. We will discuss it further
> below.

In the above, the capital letters and the indices in {} are part of
the tag, while the .x or .y is the "destination port".

How should that look in Python?  You could imagine phrasing

    send (x+y) to S.x {i,j,k+1};

as

    S[i, j, k+1].x = x + y

Then you could rewrite the above as

    node M {i,j,k} (x,y): S[i, j, k].y = x * y
    node S {i,j,k} (x,y):
        if k < N: S[i, j, k+1].x = x+y
        else: C[i, j] = x+y

But the "node" lines are a problem.

There's no way in Python to write a new kind of subroutine, nor to
write a normal lambda, nor to implicitly bind a name within a scope.
But you could write this:

    class M(node):
        def run((i, j, k), x, y): S[i, j, k].y = x * y
    class S(node):
        def run((i, j, k), x, y):
            if k < N: S[i, j, k+1].x = x+y
            else: C[i, j] = x+y

Or, with decorator syntax, this:

    @node
    def M((i, j, k), x, y): S[i, j, k].y = x * y
    @node
    def S((i, j, k), x, y):
        if k < N: S[i, j, k+1].x = x+y
        else: C[i, j] = x+y

So how can we make that work?

"""

class Machine:
    def __init__(self):
        self.memory = {}
        self.tasks = []
    def run(self):
        (node, indices), args = self.tasks.pop()
        self.message(("running", node, indices, args))
        node.run(indices, args)
    def retrieve(self, tag):
        "Destructive retrieve by tag."
        rv = self.memory[tag]
        del self.memory[tag]
        return rv
    def send(self, tag, argname, value):
        try: old_argname, old_value = self.retrieve(tag)
        except KeyError: self.memory[tag] = argname, value
        else:
            assert argname != old_argname
            self.postpone(tag, {argname: value, old_argname: old_value})
    def postpone(self, tag, data): self.tasks.append((tag, data))
    def message(self, message): print message

class Node:
    def __init__(self, machine, function, argnames, name = '(anon)'):
        self.machine = machine
        self.function = function
        assert len(argnames) == 2, argnames
        self.argnames = argnames
        self.name = name
    def __repr__(self): return '<node %s>' % self.name
    def __getitem__(self, indices): return NodeReceiver(self, indices)
    def assign(self, indices, argname, value):
        assert argname in self.argnames, (argname, self.argnames)
        self.machine.send((self, indices), argname, value)
    def run(self, indices, args):
        arglist = [indices] + [args[argname] for argname in self.argnames]
        self.function(*arglist)

class NodeReceiver:
    def __init__(self, node, indices):
        self.__dict__['node'], self.__dict__['indices'] = node, indices
    def __setattr__(self, attr, value):
        self.node.assign(self.indices, attr, value)

machine = Machine()

def magically_get_argument_names(function):
    code = function.func_code
    return code.co_varnames[:code.co_argcount]

def node(function):
    "A decorator."
    return Node(machine, function, magically_get_argument_names(function)[1:],
                name = function.func_name)

def test():
    "Regression test."

    @node
    def M((i, j, k), x, y): S[i, j, k].g = x * y

    @node
    def S((i, j, k), f, g):
        if k < N: S[i, j, k+1].f = f+g
        else: C[i, j] = f+g

    C = {}
    N = 4
    S[3, 4, 4].f = 3
    S[3, 4, 4].g = 4
    machine.run()
    assert machine.memory == {}
    assert machine.tasks == []
    assert C == {(3, 4): 7}

    S[1, 2, 4].f = 1
    M[1, 2, 4].x = 3
    M[1, 2, 4].y = 4
    machine.run()
    machine.run()
    assert C[1, 2] == 13

if __name__ == '__main__': test()


Tue, 13 May 2008

This message was sent to kragen-tol at canonical.org> and <kragen-hacks at canonical.org
and should fail to get through.

ah
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.canonical.org/pipermail/kragen-discuss/attachments/20080513/dce96279/attachment.htm