Mon, 31 Oct 2005

http://en.wikipedia.org/wiki/Escape_velocity says:
    On the surface of the Earth the escape velocity is about 11.2
    kilometres per second.

You have: 100 kg * (11.2 km/sec) * (11.2 km/sec) / 2
You want: kilowatt hours
        * 1742.2222
        / 0.00057397959

So 1700 kWh per (large) person, to lift them out of Earth's gravity
well (assuming perfect efficiency, as with a space elevator.)

http://www.ecoworld.org/energy/EcoWorld_Energy_Resid_KWH_Prices.cfm
lists average US residential electricity prices from 6.5 to 14.8 cents
per kWh, with an outlier at 33.3 in San Francisco during the
California energy crisis.  It also claims that the cost of the fuel
alone amounts to about 0.5 to 1 cent per kWh.

So if we have to pay 10 cents per kWh, lifting a person into space
should cost around $170 --- an energy cost that could in theory be
recovered if they came back down.  (At present this energy is mostly
dissipated thermally.)

Evacuating the entire human race to an extraterrestrial habitat
prepared to handle them should then have an energy cost around $1
trillion.  This is roughly 2% of annual world GDP ($55.9 trillion) at
PPP.  (See http://www.worldbank.org/data/databytopic/GDP_PPP.pdf for
details.)

Current world energy usage is around 354 exajoules
(http://energy.er.usgs.gov/products/Papers/WMC/17/) or 400 exajoules
(http://www.pugwash.org/reports/pac/agra/agra_reports_wg4.htm) or 375
exajoules (http://www.globalfuture.com/0002.htm) or thereabouts.  1700
kWh per person is

You have: 1742 kilowatt hours * 6 billion
You want: joules
        * 3.76272e+19
        / 2.6576519e-20

38 exajoules, a significant fraction of world yearly energy usage, but
far from unimaginable.  It would probably be enough to raise the unit
price of energy.

I conclude that, while technical obstacles currently make evacuation
of Earth's population to extraterrestrial colonies impossible, the
energy cost of the evacuation itself is within reason.

Sat, 29 Oct 2005

#!/usr/bin/python
"""Natural-order sorting, supporting embedded numbers.

foo9bar2 < foo10bar2 < foo10bar10

"""
import random, re, sys

def natsort_key(item):
    chunks = re.split('(\d+(?:\.\d+)?)', item)
    for ii in range(len(chunks)):
        if chunks[ii] and chunks[ii][0] in '0123456789':
            if '.' in chunks[ii]: numtype = float
            else: numtype = int
            # wrap in tuple with '0' to explicitly specify numbers come first
            chunks[ii] = (0, numtype(chunks[ii]))
        else:
            chunks[ii] = (1, chunks[ii])
    return (chunks, item)

def natsort(seq):
    "Sort a sequence of text strings in a reasonable order."
    alist = [item for item in seq]
    alist.sort(key=natsort_key)
    return alist

def ok(a, b): assert a == b, (a, b)
def test():
    ok(natsort('foo10bar10 foo9bar2 foo10bar2'.split()),
       'foo9bar2 foo10bar2 foo10bar10'.split())

    nums = map(str, range(100))
    random.shuffle(nums)
    ok(natsort(nums), map(str, range(100)))
    assert nums != map(str, range(100))  # didn't get mutated...

    # compare lexically when numerical value identical
    ok(natsort('b1 a2 a1 b01 b2'.split()), 'a1 a2 b01 b1 b2'.split())

    # A lazy implementor might support only short numbers; this prevents that:
    ok(natsort(['a100000000000000000000000.txt',
                'a99999999999999999999999.txt']),
       ['a99999999999999999999999.txt', 'a100000000000000000000000.txt'])

    # Don't interpret 033 as octal (decimal 27).
    ok(natsort(['033', '32']), ['32', '033'])
    
    # This one was hard and possibly not worth it: use floating point
    # to handle non-integer numbers.  The cost is that sequences of
    # integers separated by dots will lose precision in some cases,
    # and also may sort funny.
    # e.g. 1234.56789012345678901234567890.34567890.
    ok(natsort(['2.49', '2.5', '2.6']), ['2.49', '2.5', '2.6'])

    # Make sure we can still handle sequences of integers with dots.
    ok(natsort(['1.2.3', '1.3.4']), ['1.2.3', '1.3.4'])

    # There are cases where we lose on floating-point precision,
    # although usually then the fallback to pure lexical comparison
    # saves us:
    ok(natsort(['1.3000000000000000000000000000000000000000000000000000001',
                '1.3000000000000000000000000000000000000000000000000000000']),
       ['1.3000000000000000000000000000000000000000000000000000000',
        '1.3000000000000000000000000000000000000000000000000000001'])

    ### Here are some cases where we get suboptimal answers:
    # Hexadecimal numbers get mangled.
    ok(natsort(['0x99', '0x9a']), ['0x9a', '0x99'])
    # In cases where you expect asciibetical order, you will be
    # disappointed.
    ok(natsort([',', '5', ':']), ['5', ',', ':'])
    # Fallback to pure lexical comparison doesn't always save us on
    # floating-point precision:
    ok(natsort(['01.3000000000000000000000000000000000000000000000000000001',
                '1.3000000000000000000000000000000000000000000000000000000']),
       ['01.3000000000000000000000000000000000000000000000000000001',
        '1.3000000000000000000000000000000000000000000000000000000'])

test()

if __name__ == '__main__':
    sys.stdout.writelines(natsort(sys.stdin))

Thu, 27 Oct 2005

You could build a pretty simple hexapod with just six cables to move a
hanging flutterwumper in six degrees of freedom, and you could probably
control the lengths of the cables with great precision; but by itself
that doesn't give you much accuracy in the finished product.

Closed-loop control, though, could give you the accuracy you need.  If
you can measure the position of the flutterwumper to great precision,
you can correct the position until it works; similarly, to correct for
cutting-head loading errors, cutting-head wear, etc., if you can
measure the depth of cuts in the material, you can correct for
whatever inaccuracies arise.

Ultrasound seems like a plausible approach at first, but for
manufacturing metal parts, you need shape accuracy (thus measurement
accuracy) on the order of 25 microns.  At sea level, that's 75
nanoseconds of timing accuracy on the sound, which means you need a
13.3 MHz sound.  Air cannot carry sounds close to that frequency.

Radio seems like a plausible approach at first too, but 25 microns is
85 * 10^-15 seconds --- 85 attoseconds.  There are no circuits
contemplated that can measure the round-trip time of radio pulses that
accurately.

So light interferometry, of one form or another, is the only feedback
mechanism I can think of that might work.  Laser phase shifts can
measure the distance of movement of an object; a circularly polarized
laser can tell which direction, too.  Using this principle you can
measure the shape of a macroscopic continuous surface.  I still don't
quite know how to measure its position to within a few microns, but
maybe I don't need to.

Suppose you wanted to measure nearby vibrations with LEDs and
a multi-megapixel camera.  The LEDs and the camera are mounted some
distance apart facing one another; the camera measures the direction to
each LED within a small fraction of a pixel.  So perhaps you could use
cheaper cameras for CNC feedback.

"Cheap" (around US$100-$300) consumer CCD cameras, such as the Minolta
DImage G500, commonly have 100-degree fields of view and roughly
2000x2500 square pixels, with 2500 pixels across the 100 degrees
--- so about 1/25 of a degree per pixel, or about 700 microradians.
That means that a pixel subtends 25 microns at a range of about 3.6cm,
and a hundredth of a pixel subtends 25 microns at a range of about 3.6m.
Optical zoom extends this range; 3x optical zoom I believe reduces the
linear field of view by a factor of 3, to about 33 degrees in this case,
so would increase the range to about 10cm at one pixel per 25 microns.

You could mount the camera either on the movable platform, with reference
LEDs on a fixed point (or possibly just referencing from the workpiece,
if the workpiece is not the fixed point), or on a fixed point looking
at LEDs on the movable platform.  In either case, two cameras looking
in different directions should give dramatically better results, since
the precision of measured depth will be much worse than the transverse
motion precision.

(Maybe purely mechanical feedback would work too, like monitoring
stepper motor voltage/current waveforms.)

The standard hexapod design makes the six legs out of precision ballscrews
which change length; Hexel has a clever design called the "Rotobot"
where the legs don't change length; instead, the points where they are
attached to a fixed object can be moved independently all the way around
a circle.

Rather than using ballscrews, it seems that a compression spring (for
example, a piece of bamboo or PVC pipe bent into a semicircle) with a
braided nylon rope would be cheaper and might suffice for many purposes.

Mon, 24 Oct 2005

Passive sonar systems have a practical difficulty: you must place many
microphones far apart and run wires to all of them (or have nearby
computers to transmit their signal over e.g. radio.)

An alternative is to use laser vibration-detection systems, such as
those used for remote snooping on speech.  These use Doppler shifts in
laser light reflected from an object (such as a window or wall) to
observe changes in the distance to an object, changes much smaller
than a light wavelength.

A set of such lasers could put many "virtual microphones" hundreds of
meters apart, with a consequent dramatic improvement in spatial
resolution, without having to distribute equipment over a large area.
I suspect that even a single laser beam scanning, say, a wall, could
improve spatial resolution significantly, analogously to synthetic
aperture radar.

Some possible uses; maybe not all of these are plausible:
- scanning three-dimensional human body shapes by their sound
  reflections, for recognition or to find concealed weapons;
- scanning environments on the other side of a wall, by the way they
  shape sound wavefronts that impinge upon that wall;
- measuring macroscopic distances and shapes such as buildings, caves,
  or terrain, for instance for building planning or inspection;
- spatially localizing sound sources in order to filter out background
  noise, for example when recording music;
- identifying, for example, vehicles by sound (having filtered out
  background noise) and measuring their position and velocity;
- characterizing sound wavefronts some distance away in order to have
  time to precompute a canceling waveform for active cancellation;
- wireless communication with people without requiring the people to
  possess any particular device; think of the user interface to the 
  Star Trek ship computer.

The spatial resolution of a passive sonar system is limited by the
wavelength of the sound.  Supposedly cats and bats can hear sound
frequencies of 75kHz-100kHz
(http://online.cctt.org/physicslab/content/phy1/lessonnotes/Sound/lessonsound.asp)
which means air must be able to carry such sounds.
http://www.massa.com/fundamentals.htm claims that attenuation grows
linearly from about 0.5 dB/foot at 50kHz to nearly 5 dB/foot at 250kHz
--- which would seem to limit 250kHz sounds to less than 25 feet, and
50kHz sounds to less than 250 feet.  250kHz is around 1.3mm, 50kHz is
6.6mm, and 10kHz is 3.3cm, so e.g. face recognition depends strongly
on ultrasonics and close range.

Really good spatial resolution probably requires the use of light
rather than sound.

Sat, 22 Oct 2005

#!/usr/bin/python
# Compute "lexical signatures" for each paragraph in a text document

# The idea is to compress these lossily into "queer numbers" to be
# used as fragment IDs.  See my "queer numbers" post on kragen-tol.

# This Python program depends on generator expressions, which were
# introduced in Python 2.4.  You can wrap the generator expressions in
# [] to turn them into list comprehensions to make them run in older
# Pythons.

import re, sys

def paragraphs(afile):
    "Iterate over the paragraphs in a text file."
    current_paragraph = ''
    for line in afile:
        if line.strip() == '':
            if current_paragraph != '': yield current_paragraph
            current_paragraph = ''
        else:
            current_paragraph += line
    if current_paragraph != '': yield current_paragraph

def words(astring):
    "Iterate over the words in a string, such as a paragraph."
    for word in re.findall("['\w]+", astring): yield word.lower()

def freq(aseq):
    "Compute a frequency table of each unique element in a sequence."
    rv = {}
    for item in aseq:
        rv.setdefault(item, 0)
        rv[item] += 1
    return rv

def mergefreqs(freqs):
    "Given a sequence of frequency tables, compute a summary table."
    rv = {}
    for freq in freqs:
        for key, value in freq.iteritems():
            rv.setdefault(key, 0)
            rv[key] += value
    return rv

def nonuniques(freqs):
    "Cull a frequency table to remove things that occur only once."
    return dict((k, v) for k, v in freqs.iteritems() if v > 1)

def count(freq):
    "Given a frequency table, return the length of the sequence it represents."
    return float(sum(freq.itervalues()))

def tfidf(freqs):
    """Compute TF/IDF over a sequence of frequency tables.

    I'm not sure this is actually TF/IDF, because I'm not sure I know
    what TF/IDF is.
    
    """
    freqs = list(freqs)
    totals = mergefreqs(freqs)
    total_terms = count(totals)
    for freq in freqs:
        rv = {}
        terms = count(freq)
        for word, n in freq.iteritems():
            tf = n / terms
            idf = totals[word] / total_terms
            rv[word] = tf/idf
        yield rv

def sigs(freqs):
    """Compute 'lexical signatures' in the form of ordered word lists,
    given a sequence of frequency tables to start with.

    """
    for thingy in tfidf(freqs):
        words = thingy.keys()
        words.sort(key = thingy.get, reverse = True)
        yield words

def display_sigs(infile):
    """Print a text summary of lexical signatures of some file."""
    freqs = list(freq(words(para)) for para in paragraphs(infile))
    hmm = nonuniques(mergefreqs(freqs))
    ii = sigs(freqs)
    max = 10
    for line in ii:
        print ' '.join(line[:max])
        print '->', ' '.join([word for word in line
                              if hmm.has_key(word)][:max])

if __name__ == '__main__': display_sigs(file(sys.argv[1]))

Thu, 20 Oct 2005

I downloaded Walter Parquhar Hook's 1842 Church Dictionary
<http://www.archive.org/details/ChurchDictionary> from the Internet
Archive and tried OCRing some text from it, using free software.  I
didn't have a lot of success, but success looks tantalizingly close.

I used DjView to extract the first page that has actual text on it.

gOCR
----

gOCR renders the first four lines of the sample book, as output by
DjView, more or less as follows:

    __E stronges_ __ecommendation o_ the _olIo_-
    @g _orb cons at ts @@ the statemen_ {_f @s be at gg
    _oR the __ost paRt, me,Rely a Comp at at@n; a__d tb@
    _eneraI ac___o_ledgment rendel_s @ unnecessarY

It actually reads:

    THE strongest recommendation of the follow-
    ing Work consists in the statement of its being,
    for the most part, merely a Compilation; and this
    general acknowledgment renders it unnecessary

A second try, using the command-line "gocr -C 
'- abcdefghijklmnopqrstuvwxyz,;ABCDEFGHIJKLMNOPQRSTUVWXYZ.'
ChurchDictionary0004.pbm" yielded, after 100 seconds of CPU time, the
following results:

    __E stronges_ __ecommendation o_ the _olIo__
    \code(011d)ng _orb consìsts ìn the statemen_ i_f ìts beîngg
    _oR the __ost paRt, me,Rely a Compìlatìon; a__d tbìs
    _eneraI ac___o_ledgment rendel_s ít unnecessarY

Ocrad
-----

(I don't remember what version of Ocrad this was --- probably 0.12, but
definitely not 0.13.)

Ocrad took only 19 seconds and produced the following results:

    rHE strongest _.ecommen_ation of _he fo_lo__
    ing Work consists iA the staLe_ent l_f its being,
    for the _oost yart, _oe,rely a Co__iI_tion; al_d _his
    gen_ral achl_o_ledg_oent rende__s ié unnecessary

Upon being told that it was trying to recognize ASCII (-c ascii), it
produced:

    rHE strongest _.ecommen_ation of _he fo_lo__
    ing Work consists iA the staLe_ent l_f its being,
    for the _oost yart, _oe,rely a Co__iI_tion; al_d _his
    gen_ral achl_o_ledg_oent rende__s it unnecessary

That's nearly good enough to be corrected with a dictionary.  

gOCR did better on the second "the", "in", "statement", "part",
"merely", "compilation", "and", "this", and "acknowledgment" --- 9 of
the 29 words, while Ocrad did better on the first "the", "strongest",
"of", "Work", "consists", "its", "being,", "for", "the", "this", and
"unnecessary", 12 of the 29.

ClaraOCR
--------

It took me a long time to find ClaraOCR, because the web site has been
stolen.  I couldn't figure out how to get ClaraOCR to produce OCR
output at first; the answer is to train it, iterating the training
until you get an acceptable output.  This is a slow process, and
doesn't produce good results even after considerable effort (the
recognizer is kind of dumb, and there's apparently no way to correct
the segmentation into symbols), but it seems that it can ultimately
produce better results than the alternative methods.

I was eventually able to train it to get the first few lines mostly correct:

     PRE[0]FACE.
     T HE strongest r ecommendation of the follow-
     ing Work consists in t he statement of its being,
     for the most part, merely a Compi, an[104]d th
     lation
    .
     is
     g eneral acknowledgment renders it unnecessary

The next four lines looked like this:

     [216]o meri[227]ro[204] [229]he [231]a[219]io[201]is so[207][208]ce[210][194] ro[214] [215]h
     i[202]h it as
     ee[245] [258]o[246] r ed [262][269]ra[264]ts ia[235][265]. [266]e[253] o teii [240]ade
     almost [293]or o[296] [297]or roiri so[305]e of o[281]i[283] eates[278]
     r[320]irie[331][318], a[341]d t e [352]o[354] i er as ee[322] sorrre[350]i[336][319][351][337] ce[325]  

They actually read as follows:

    to mention the various sources from which it has
    been compiled.  Extracts have been often made
    almost word for word from some of our greatest
    Divines, and the compiler has been sometimes cen-

Slightly better segmentation and a dictionary would help considerably
here.  "meri.ro." would probably be "men.io." with better
segmentation, and /usr/share/dict/words has only one possibility for
that; likewise ".a.io.is" occurs only as "raviolis", and would
probably be ".a.ious" with better segmentation --- yielding
possibilities "carious", "sanious", and "various".  My frequency
analysis of the British National Corpus
<http://pobox.com/~kragen/sw/wordlist> found "various" 15503 times,
"carious" 7 times, and "sanious" and "raviolis" less than 5 times, so
it should be pretty easy to pick the right one.

This is after 140 cycles of recognition, which took nearly an hour.

ClaraOCR also includes some code for cooperative web-based OCR, but I
haven't tried it yet.

ClaraOCR has a user interface that makes its OCR process dramatically
more transparent than gOCR's and Ocrad's, so I feel better about it
than the above miserable performance would lead one to expect.

ocre
----

ocre wouldn't work on the ASCII PBM, claiming it wasn't a PBM, only
the binary PBM; eventually it popped up a bunch of windows to get help
with letters it was having trouble with.  It seemed to be having
trouble with a lot of letters, so I eventually gave up.  

DjVu
----

DjVu isn't in the same category as ocre, ClaraOCR, gOCR, and Ocrad,
but it's important.  The DjVuLibre tools provide powerful
free-software compression algorithms for handling page images,
particularly bilevel page images, and DjView is a far better document
reader than xpdf, ghostview, gv, or xdvi, because it's usually
instantly responsive, supports copy and paste, and supports text
search.

It isn't presently the case that DjVu has a good free encoder for
scanned data.  Presumably that will eventually change, after it
becomes important.

In the meantime, DjVu will be important for several reasons:
- the Internet Archive is releasing a large volume of public-domain
  page images as part of their Million Books Project in DjVu format,
  but presently have no OCR data for them.
- DjVu files can take advantage of OCR output for copying and
  full-text search.  The free-software "djvused" command can add OCR
  output to existing DjVu files without re-encoding the images.
- DjVu files may be a useful format for acquiring training and
  evaluation data for OCR programs, because unlike every widely-used
  file format, they contain both the raw (possibly compressed) page
  images, and "ground truth" OCR output.  Consequently the adoption of
  DjVu will make OCR training and evaluation data much easier to come
  by than in the past.

Directions for improvement
--------------------------

Better UI: ClaraOCR has spent the largest amount of work on its user
interface, but despite all that, it's far from obvious how to get any
output at all, and entering the correct transliteration for a single
five-letter word requires five mouse clicks or five arrow-key presses;
and classifying a "symbol" that its segmenter has discovered as
"noise" requires many more mouse clicks.

Most of the improvements suggested below could also be used to
dramatically speed up the training process.

Dictionaries: all 26 of the distinct words in the text segment I
tested with occur in my word list mentioned above; the least frequent
are "ing", with 123 occurrences, and "acknowledgment" with 96 (the
modern spelling "acknowledgement" has 554.)  (The total of all
frequencies therein is 90080933, a little over 90 million, but that
doesn't include the words that occurred fewer than five times.)  This
suggests that a little bit of fixing up based on language-specific
frequencies could help a lot.

Combination: it is at least possible that, for example, gOCR and Ocrad
together could produce better results than either alone, since each
excelled the other on certain words.

Ambiguous output: if the text output is to be used to reformat the
scanned text (for example, for columns of a different width, for a
device with less storage space, or for a text reader) it is obviously
essential to choose the single most likely reading of the text; but if
OCR is being used merely to make a set of images searchable,
"f?ro(m|iri)" is a perfectly good transliteration of "from", which
ClaraOCR rendered above as "roiri".

Better algorithms: the text I was OCRing is eminently readable to
human eyes, despite being slightly askew and printed with somewhat
worn and dented type.  It's absurd that gOCR could only correctly
recognize "a" and a couple of instances of "the" in the original text;
that Ocrad got only 11 of the 29 words correct, even without a
dictionary; and that after an hour of training of ClaraOCR, it had a
similar success rate to (untrained!) gOCR on the next four lines.

Better evaluation data: a standard OCR corpus for evaluating and
training software would probably help a lot.  There's the ISRI OCRtk
OCR Performance Toolkit, which contains "a large and diverse corpus of
280 scanned page images with corresponding ground-truth text," but
it's not clear whether it's free software.

Mon, 17 Oct 2005

A "tree view" is a way of exploring tree-structured data; at any given
time, a subset of the tree is displayed, and clicking on any node to
display its children or hides all its descendants.  An editable tree
view, where you can easily change the contents of any node or copy or
move it elsewhere in the tree, is often called an "outliner".

Some people are big fans of outliners as universal user interfaces,
notably Douglas Engelbart and Dave Winer.  They have the advantage
that the display at any given time can include data from many
different sources, but still remains essentially linear.

I wrote recently about the art of summarization or rubrication as an
aid to navigation and retrieval.  These days, when I want to find a
web page on some topic, I might perform a Google query; this yields
ten web page titles, which I generally read in sequence, and when I
find one that sounds promising, I read the extract.  If it sounds
sufficiently interesting, I might follow through to the page itself,
which may have its own table of contents summarizing different
sections of the document.

This sounds a lot like outlining, and you could imagine a web browser
that would display the whole process as an outline, dynamically
transcluding text beneath each link you click on.  This may or may not
be an improvement on the standard web browser UI; probably for some
tasks it would be, especially if the web pages in question were
designed for it.

Today I was watching Dick Hardt's Identity 2.0 presentation from OSCON
<http://www.identity20.com/media/OSCON2005/>.  In Larry Lessig style,
his slides changed more than once per second, and typically had only a
word or two, or a simple diagram, in each slide.  I used mplayer's "]"
key to speed the presentation up to 6x real time, and in consequence I
zipped through the presentation summary pretty quick; when something
looked interesting, I would use the "[" key to slow back down to to
normal speed, and the "<-" key to zip back to what I'd just skipped
over.

It seems to me that this kind of interaction technique was closely
analogous to the operation of an outliner, and could likewise be
applied to navigating hypertext.  If "linked-to" text is temporally
compressed (by some standard ratio) after the "link text", you could
skip over it by speeding up.  My analysis of typical summarization
suggests that if the standard ratio were a factor of 60 (six octaves),
the compressed text would usually take about the same amount of time
as its summary; six octaves is enough that it could perhaps run
concurrent with the "link text" without interfering with its
comprehensibility.

Even an infinitely-deep link structure could fit in some finite time
by this method, even without concurrency, because the total time is
the sum of an exponential series.  

More refinements:

Adjusting the speed should pop you back some fixed amount of
subjective time, since presumably there's some time lag.

Without concurrency, there should probably be a relatively nearby
floor on the speed of any particular text, perhaps an octave or two
below 'normal speed'; otherwise you might find yourself in a very long
silence.

Without any extra help, speech sounds fine at 1.75x or so (about 3/4
octave up); apparently by squeezing out pauses and using a little
signal-processing magic, people can quickly learn to understand speech
at 2x.  This could be very helpful.

Obviously this all applies to video as well as audio.

To actually play this, at some point you have to cheat and limit
yourself to some finite sample rate.

Sat, 15 Oct 2005

#!/usr/bin/python
"""Pretty-print object-calculus expressions with various bits of shorthand.

In particular, elide 'self' names on method definitions that don't
need them; use () instead of {}.'()' in ObjectDerivation expressions
enclosed in a MethodCall(x, '()'); and elide 'arg1', 'arg2', etc., in
ObjectDerivation expressions that contain all of them.

It seemed to me that objcalc.py was getting too large, and I wanted to
pull out the pretty-printing functionality into another file,
particularly since it involves some substantial tree rewriting.
Unfortunately there doesn't seem to be a good way to do this in
Python.

What I'd really like to do is create an object 'objcalcpp' derived
from the 'objcalc' module, add an 'occurs_free' method to several of
the classes 'objcalcpp' inherits from 'objcalc', and override
'objcalcpp.SyntaxNode.stringify_methods',
'objcalcpp.MethodDefinition.__str__', and
'objcalcpp.ObjectDerivation.__str__', but while you can do that kind
of thing in the object-calculus or in Ohmu, you can't do it in Python
or any other current deployed OO language.

Approach #1: create two big functions full of 'isinstance' tests and
lots of tightly-coupled attribute access.  Works OK for 'occurs_free',
which only has to compute a boolean over the tree it gets (although I
did leave out a case here, fortunately one I tested for); but for
pretty_print, it needs to either modify or duplicate some of the code
presently found within the objects themselves, namely the __str__
methods of MethodDefinition, ObjectLiteral, ObjectDerivation, and
MethodCall, in order to redirect their recursive calls to itself.  Got
this working.

Drawbacks: not very OO (tight coupling, isinstance), error-prone (due
to case analysis), code duplication.

Approach #2: parallel class hierarchy.  A code smell in itself, but
could avoid reimplementing various __str__ methods.  Probably would
have to use multiple inheritance to redefine __getitem__ and __call__
in all the classes so that the normal syntactic sugar would work.

Approach #3: monkeypatching (known in other languages as 'reopening
classes from another module').  Replace methods of original classes
with 'better' versions.  This approach is, in some sense, the closest
approach to AOP.

Approach #4: rewrite in AspectJ.

Approach #5: rewrite in Haskell.

Approach #6: rewrite in the object-calculus.

"""
import objcalc

# approach #1

def template_function_to_copy_and_paste(var, ocexpr):
    """Not very object-oriented!"""
    if isinstance(ocexpr, objcalc.Constant):
        pass
    if isinstance(ocexpr, objcalc.Variable):
        pass
    elif isinstance(ocexpr, objcalc.ObjectLiteral):
        pass
    elif isinstance(ocexpr, objcalc.ObjectDerivation):
        pass
    elif isinstance(ocexpr, objcalc.MethodCall):
        pass
    raise CantHappen

def occurs_free_in_methods(var, methods):
    for method in methods:
        if var == method.selfname: continue
        if occurs_free(var, method.methodbody): return True
    return False

class CantHappen(Exception): pass

def occurs_free(var, ocexpr):
    if isinstance(ocexpr, objcalc.Constant):
        return False
    if isinstance(ocexpr, objcalc.Variable):
        return var == ocexpr
    elif isinstance(ocexpr, objcalc.ObjectLiteral):
        return occurs_free_in_methods(var, ocexpr.methods)
    elif isinstance(ocexpr, objcalc.ObjectDerivation):
        return (occurs_free(var, ocexpr.baseobj) or
                occurs_free_in_methods(var, ocexpr.methods))
    elif isinstance(ocexpr, objcalc.MethodCall):
        return occurs_free(var, ocexpr.objexpr)
    raise CantHappen

def prettyprint_method(method):
    name = objcalc.namequoter.quote(method.methodname)
    body = prettyprint(method.methodbody)
    if occurs_free(method.selfname, method.methodbody):
        return '%s.%s = %s' % (prettyprint(method.selfname), name, body)
    else:
        return '%s = %s' % (name, body)

def prettyprint_methods_of(objexpr):
    return objexpr.indent('\n'.join(map(prettyprint_method, objexpr.methods)))

def prettyprint_derivation(ocexpr, delims = '{}'):
    left, right = delims
    return '%s %s\n%s%s' % (prettyprint(ocexpr.baseobj),
                            left, prettyprint_methods_of(ocexpr), right)

def prettyprint(ocexpr):
    if (isinstance(ocexpr, objcalc.Constant) or 
        isinstance(ocexpr, objcalc.Variable)):
        return str(ocexpr)
    elif isinstance(ocexpr, objcalc.ObjectLiteral):
        return '{\n%s}' % prettyprint_methods_of(ocexpr)
    elif isinstance(ocexpr, objcalc.ObjectDerivation):
        return prettyprint_derivation(ocexpr)
    elif isinstance(ocexpr, objcalc.MethodCall):
        if ocexpr.methodname == '()' and isinstance(ocexpr.objexpr,
                                                    objcalc.ObjectDerivation):
            return prettyprint_derivation(ocexpr.objexpr, delims = '()')
        return '%s.%s' % (prettyprint(ocexpr.objexpr),
                          objcalc.namequoter.quote(ocexpr.methodname))
    raise CantHappen

# approach #2 (unfinished)

class Constant(objcalc.Constant): pass
class Variable(objcalc.Variable): pass
class ObjectLiteral(objcalc.ObjectLiteral): pass
class ObjectDerivation(objcalc.ObjectDerivation): pass
class MethodCall(objcalc.MethodCall): pass
class MethodDefinition(objcalc.MethodDefinition): pass

def ok(a, b): assert a == b, (a, b)
def test():
    self = objcalc.Variable('self', 'ocpp')
    empty = objcalc.ObjectLiteral([])
    _37 = objcalc.Constant(37)
    callme = empty(self['()'] >> _37)
    called = callme['()']
    ok(called.evaluate({}), 37)

    # omit 'self' where it doesn't occur free in the definition
    ok(occurs_free(self, self), True)
    ok(occurs_free(self, _37), False)
    # XXX need more tests for occurs_free
    ok(prettyprint(self), 'self')
    ok(prettyprint(_37), '37')
    ok(prettyprint(empty), "{\n}")
    ok(prettyprint(callme), "{\n} {\n" +
       "  '()' = 37\n" +
       "}")
    ok(prettyprint(callme['cat']), prettyprint(callme) + '.cat')
    ok(prettyprint(called), "{\n} (\n" +
       "  '()' = 37\n" +
       ")")

test()

def sample_code():
    """Returns an object-calculus expression with a little bit of sample code.

    This particular code is intended to be the ultimate parent of all
    objects.

    The code includes working definitions for 'true' and 'false', and
    a possibly-working implementation of Lisp-style lists with ':' for
    the cons operator, '++' for the list append operator, a 'filter'
    method on unary functions (using the booleans) to extract items
    from a list for which that function returns true, and a 'qsort()'
    method that uses 'filter', ':', and '++'.

    So far there's no support for infix operators, positional
    arguments, or not splitting short expressions onto many lines, in
    the pretty-printer, so code that should look like this:

        '++'.'()' = cons.head : (cons.tail ++ '++'.arg1)

    is rendered by the pretty-printer as follows instead:

        '++'.'()' = cons.head.':' (
          arg1 = cons.tail.'++' (
            arg1 = '++'.arg1
          )
        )

    """
    
    obj = Variable('object', 'ocpp.sample_code')
    self = Variable('self', 'ocpp.sample_code')
    my = Variable('my', 'ocpp.sample_code')
    function = Variable('function', 'ocpp.sample_code')
    cons = Variable('cons', 'ocpp.sample_code')
    pp = Variable('++', 'ocpp.sample_code')
    colon = Variable(':', 'ocpp.sample_code')
    x = Variable('x', 'ocpp.sample_code')
    filt = Variable('filter', 'ocpp.sample_code')
    booleans = Variable('booleans', 'ocpp.sample_code')

    return ObjectLiteral([
        self['booleans'] >> ObjectLiteral([
            booleans['true'] >> ObjectLiteral([
                self['negated'] >> booleans['false'],
                self['ifTrue'] >> ObjectLiteral([
                    x['then'] >> Constant(1),
                    x['else'] >> Constant(0),
                    x['()'] >> x['then'],
                ]),
                self['ifFalse'] >> self['negated']['ifTrue'],
            ]),
            booleans['false'] >> booleans['true'](
                self['negated'] >> booleans['true'],
                self['ifTrue'] >> booleans['true']['ifTrue'](
                    x['()'] >> x['else'],
                ),
            ),
        ]),
        # nil is the empty list, and the root of the list-node hierarchy
        obj['nil'] >> ObjectLiteral([
            self['isNil'] >> obj['booleans']['true'],
            self['++'] >> obj['unary_function'] (
                pp['arg1'] >> self,
                pp['()'] >> pp['arg1'],
            ),
            self['qsort'] >> ObjectLiteral([ my['()'] >> self ]),
        ]),

        # Comparison functions: you only have to define '<' and '==' in
        # your classes
        obj['>='] >> obj['<']['negated_function'],
        obj['<='] >> obj['>']['negated_function'],
        obj['!='] >> obj['==']['negated_function'],

        # There's no way to do 'flip' without 'getattr'.
        obj['>'] >> obj['unary_function'] (
            my['()'] >> my['arg1']['<'](self['arg1'] >> obj)['()'],
        ),

        # Unary functions have a method for filtering lists
        obj['unary_function'] >> ObjectLiteral([
            self['arg1'] >> Constant("Uninitialized unary function argument"),
            self['()'] >> Constant("Undefined function result"),
            function['filter'] >> obj['unary_function'] (
               filt['()'] >> filt['arg1']['isNil']['ifTrue'](
                    self['then'] >> filt['arg1'],
                    self['else'] >> function(x['arg1'] >>
                        filt['arg1']['head'])['()']['ifTrue'](
                            self['then'] >> filt['arg1'][':'](self['arg1'] >>
                                filt(x['arg1'] >> filt['arg1']['tail'])
                            )['()'],
                            self['else'] >> filt(
                                x['arg1'] >> filt['arg1']['tail'])['()'],
                        )['()'],
                   )['()'],
            ),
            function['negated_function'] >> obj['unary_function'] (
                my['()'] >> function(self['arg1'] >> my['arg1'])['()']['negated'],
            ),
        ]),

        # Define an infix operator that constructs list nodes
        self[':'] >> self['unary_function'] (
            colon['()'] >> self['nil'] (
                x['isNil'] >> self['booleans']['false'],
                x['head'] >> self,
                x['tail'] >> colon['arg1'],
                cons['++'] >> self['nil']['++'] (
                    pp['()'] >> cons['head'][':'](
                        my['arg1'] >> cons['tail']['++'](x['arg1'] >> pp['arg1'])['()'],
                    )['()'],
                ),
                self['qsort'] >> ObjectLiteral ([
                    my['low'] >> self['head']['<']['filter'](x['arg1'] >> self['tail'])['()'],
                    my['high'] >> self['head']['>=']['filter'](x['arg1'] >> self['tail'])['()'],
                    my['()'] >> my['low']['qsort']()['()']['++'] (
                        x['arg1'] >> self['head'][':'](
                            x['arg1'] >> my['high']['qsort']()['()']
                        )['()'],
                    )['()'],
                ]),
            ),
        ),
    ])

Thu, 13 Oct 2005

Suppose you wanted to plant a hidden camera for some long period of
time and capture photos of all that went past.  You'd like to never
again have to enter the place where it's hidden, and only visit it
rarely; you'd like it to be small; and you'd like it to last a long
time.  For example, the book "The Social Life of Small Urban Spaces"
was based on a few years of research in this vein using Super 8
cameras for time-lapse photography.  It appears to me that this
equipment should now be incredibly cheap.

USB "webcams" that capture 100-kB 640x480 JPEGs are on the order of
$10.  I think 4-port USB hubs (again, on the order of $10) contain all
the hardware necessary to act as USB host controllers; one could
imagine integrating the USB hub hardware with a small single-board
computer with SD/MMC and Bluetooth interfaces, for a total cost on the
order of $50 plus up to 4 cameras and their USB cables, and an MMC
card ($50-$110).

This device would presently be limited in smallness only by the size
of its power supply, USB ports, and multi-chip integration, so it
could be concealed in many places.  You could probably run it on 200mW
when running (for less than a second) and <1mW when idle.

You could drop by periodically with an inconspicuous Bluetooth device,
such as a cellphone or laptop, to download the pictures (say, 4
cameras * 100kB/shot/camera * 4 shots / minute * 60 minutes/hour * 24
hours/day = 2.3GB/day; but one shot per minute is only 144MB/day).
Anyone snooping over Bluetooth at the time could tell that a lot of
data was being sent over Bluetooth (1megabit/sec? not sure; but at
that speed you'd have to spend 2300 seconds in the vicinity.)

Alternatively, you could use a directional antenna from hundreds of
meters away (the "Bluesniper" folks managed to do 1km.)

An adaptive surveillance algorithm could shoot four times per minute
until the data card was full, followed by twice a minute (replacing
every other old shot, starting with the oldest) until the data card
was all full at twice a minute, then once per minute (thinning out old
shots to once a minute) until it was full again, etc.

Supposing that USB 12Mbps transfers were the limiting factor, you'd
need about 67ms of "on time" per shot, or (according to my 200mW
estimate above) 13.4 mJ.  My laptop's Li-ion battery supposedly holds
around 46Wh, or 165kJ (abridged info below):

$ cat /proc/acpi/battery/BAT1/state 
present rate:            1227 mA
remaining capacity:      2579 mAh
present voltage:         11300 mV
$ cat /proc/acpi/battery/BAT1/info
design capacity:         4500 mAh
last full capacity:      4067 mAh
design voltage:          10800 mV
model number:            XM2018P02   
battery type:            Li-ION  

11.3V * 4.067Ah = 46Wh.

On that basis, my laptop's battery could power 12 000 000 invasions of
privacy by this system --- saving that many camera shots to an MMC
card.  It might only be able to power 4 000 000 invasions of privacy
if it had to transmit them all over Bluetooth.  Still, that's nearly
six months in the four-shots-with-four-cameras-per-minute maxi
configuration described above, where you'd have to come download up
your photos at least once a day, and at one camera shooting once per
minute, it would last 8 years.

(I'm assuming that the webcams power up instantly.  This may be
unreasonable.)

Obviously you could do a similar job with audio surveillance, but
ironically, this may consume more storage and power; minimally
comprehensible speech is 10kbps under the best of conditions, so you'd
need at least 108MB/day, and probably several times that to get
anything useful.  You'd need some very-low-power constantly-on device
to buffer the audio so you wouldn't have to run the CPU all the time.

A similar system, but without the cameras or other transducers, could
serve as a maildrop or backup server (for data with high value per
byte, obviously).

We can anticipate that the power and monetary cost of data storage and
transmission will decrease considerably more before Moore's Law runs
out.

Mon, 10 Oct 2005

"Purple numbers" are a way to let people link to every paragraph of
every web page you publish, by adding unobtrusive short serial numbers
to every newly created paragraph.  Chris Dent and Eugene Eric Kim
invented them (I think), and they've been implemented in IRC-logging
software, Wikis, and blogging software.  In PurpleWiki, you can even
transclude pragraphs from arbitrary Wiki pages by purple number, so
that edits to those other Wiki pages are immediately reflected in the
page that transcludes them.

But purple numbers only solve the problem for sites you control.  I'd
often like to link to, or transclude, some paragraph in a page I don't
control.  How can I do this, without the site owner providing any
stable naming scheme for paragraphs?

A few years ago, some folks [find reference!] came up with a "lexical
signature" scheme for persistent naming of web pages.  They picked the
five nonunique words from the page with the highest TF/IDF scores,
using the entire Web as the corpus of reference for these scores.
Then they could use a full-text search engine such as AltaVista or
Google to find the document if its author irresponsibly allowed its
URL to become invalid.  (Unique words were excluded because they were
usually misspellings.)

Their scheme appended the lexical signature as query terms to the URL,
as in "?lexical_signature=purple+wiki+transclude+lexical+idf".  The
idea was to modify web browsers to automatically indirect through a
search engine for these links.

Clearly you could do the same thing for paragraphs with a fragment
identifier:
"#lexical_signature=few+corpus+reference+altavista+google".  But that
seems like an unnecessarily large fragment identifier for a
closed-world problem like distinguishing one paragraph from another in
a single web page.

So I suggest the following instead.  Make a bit-vector of zeroes; hash
the terms in the paragraph with the highest TF/IDF (calculating scores
relative to this document as the corpus) to index bits in your
bit-vector, setting each one to 1 in turn, until the bit-vector is
half full.  To dereference a link expressed as such a bit-vector,
calculate these same bit-vectors for each paragraph in the document,
and take the one with the smallest Hamming distance from the original
fragment ID.

You shouldn't need very many bits in your bit-vector.  I think 12 or
18 bits should be adequate.  You can encode these bit-vectors in
base64 in two or three characters, perhaps with some prepended special
character to prevent ordinary fragment IDs from being misinterpreted
in this way.  I suggest ".", since it's not a valid SGML NAME
character, but it's easy to type.

So now you'll have links that look like "foo.html#.2c" or
"walnuts.html#.X/.".  I propose calling these identifiers "queer
numbers", because they're a lot like "purple numbers", but they're
much more aggressive and "in your face", willing to go everywhere,
even where they're not welcome and not invited, all in the cause of
improving discourse and referenceability.

***

As a sample, here are the top TF/IDF words from the paragraphs in an
earlier version of this document:
people number adding serial implemented are by been edits other
-> are by them every software wiki pages even way that
control often don't site owner any sites how only without
-> control paragraphs naming do problem some scheme transclude link i
irresponsibly text years picked allowed find if use unique they
-> find they its words were reference scores altavista google because
idea browsers through query indirect their appended was modify automatically
-> lexical links engine terms lexical_signature search url signature wiki as
an single another clearly seems distinguishing closed world unnecessarily large
-> identifier google paragraphs same few do fragment altavista lexical_signature could
is relative expressed hash index calculate make id 1 smallest
-> each vector bit one document suggest same tf full terms
prevent since being two need special perhaps base64 character three
-> character it's bits suggest way vectors or some your not
all have queer go aggressive calling much discourse html lot
-> html they're not and because links even numbers like your

The lines preceded by '->' are the same as the previous lines, but
only contain nonunique words.  (Generally, unique words have high
TF/IDF!)  As you would expect from TF/IDF, they seem to have done a
pretty good job at excluding words included in other '->' lines.

This brings up an interesting point: you might get better Hamming
distances with bit-vectors less than half full, since the first few
words are likely to contain more-significant words than later words,
and you could actually use this feature to have a richer scoring
function than merely Hamming distance --- having a bit set that
corresponds to the paragraph's #1 word should perhaps count more than
the #2 word, etc.

Here's the same computation run on a more recent version of this
document (a version that happened to be 60% larger):
eric irc pragraphs eugene let unobtrusive reflected wikis short kim
-> every them software people number adding serial implemented been pages
stable i'd solve providing control often don't site owner any
-> control often don't site owner any sites how without naming
invalid excluded folks then persistent five ago entire using author
-> its were irresponsibly reference text years picked allowed find if
lexical idea browsers through query indirect their appended was modify
-> lexical idea browsers through query indirect their appended was modify
thing identifier single another clearly seems distinguishing closed world unnecessarily
-> identifier single another clearly seems distinguishing closed world unnecessarily large
take instead setting zeroes calculating dereference until turn following original
-> vector each is one relative expressed hash index calculate make
valid shouldn't encode adequate many type misinterpreted 18 be easy
-> it's prevent being two need special base64 character three should
foo cause referenceability propose welcome walnuts willing everywhere improving now
-> they're all queer go aggressive calling much discourse html lot

-> 
here top sample earlier paragraphs all don't being text years
-> paragraphs all don't being text years browsers through queer go
preceded high done seem previous would generally expect pretty good
-> ' lines nonunique contain have only other tf unique idf
distances set point merely corresponds likely scoring better interesting feature
-> than word hamming should more half distance up contain use
on run computation here's recent version more same document this
-> version more same document this of a the

>From a cursory inspection, it appears that the word lists remained
relatively stable through the 60% enlargement of the document ---
stable enough, anyway, that they would mostly retain linkage.
However, repeating the experiment again was not terribly promising,
although I suspect that this document (to which I have appended lists
of unique words from its previous versions) may be a particularly
tough nut to crack.

Sat, 08 Oct 2005

"""Very basic object-calculus evaluator.

No parser yet, though there's a pretty-printer (well, that's
stretching 'pretty'...).  Horrible hacks for abusing Python's parser
though.  See the bottom.

Thanks to Martin Abadi and Luca Cardelli for the formalism, David
Gibson for lending me their mindblowing book, Dave Long for
encouragement and mentoring, Mark Lentczner for giving me a clue about
the internals of prototype-based object systems (although clue
probably doesn't show here), Jonathan Edwards for building something
that actually worked and teaching me about it, DeLesley Hutchins for
friendship, ideas, and Ohmu.

TODO:
- write a pretty-printer (done in another module now):
  - x{...}.'()'  ->  x(...)
  - x = an_expr  ->  someunusedselfvar.x = an_expr
- write a GUI
- write a parser
- add more shortcuts to the pretty-printer, like those used in my recent
  kragen-hacks post with more software in this language:
  - {a, b, c}    ->  { arg1 = a, arg2 = b, arg3 = c }
  - infix operators: x + y -> x.'+'(y) (i.e. x.'+'{arg1=y}.'()'
  - x[y] -> x.'[]'(y)
- add primitive operations so it could possibly compute something in a
  useful way
- add a 'lint' to check for calls to nonexistent methods.  This isn't
  necessarily an error, since the methods may be added in subclasses,
  but it's always an error when it happens in my code.

"""

import types

def any(alist):
    for item in alist:
        if item: return True
    return False

class Record:
    """An object that has some fields."""
    # Reimplemented more or less from memory from MetaPy.Record.
    def __init__(self, *args):
        assert len(args) == len(self.fields)
        for name, value in zip(self.fields, args):
            setattr(self, name, value)
    def __repr__(self):
        comma = ', '
        return '%s.%s(%s)' % (self.__module__, self.__class__.__name__,
                              comma.join([repr(getattr(self, name))
                                          for name in self.fields]))

class OCObject(Record):
    """Object-calculus object."""
    fields = ['methods']
    def methodnames(self):
        return [name for name, method in self.methods]
    def call_method(self, methodname):
        """Evaluate method's return value."""
        for name, method in self.methods:
            if name == methodname: return method.call(self)
        raise "Method not found", methodname
    def derive(self, more_methods):
        """Create derivative object with me as prototype."""
        return OCObject(more_methods + self.methods)

class OCMethod(Record):
    """Object-calculus method."""
    fields = 'selfname', 'body', 'environment'
    def augment(self, name, val):
        rv = self.environment.copy()
        rv[name] = val
        return rv
    def call(self, obj):
        return self.body.evaluate(self.augment(self.selfname, obj))

class SyntaxNode(Record):
    """Abstract class: tree of symbols you might write out on paper."""
    # syntactic sugar below
    def __getitem__(self, other): return MethodCall(self, other)
    def __call__(self, *overrides):
        return ObjectDerivation(self, overrides)

class Quoter(Record):
    """Helper object to quote strings customizably."""
    fields = ['delimiter']
    def escape_char(self, char):
        if char in '\\' + self.delimiter: return '\\' + char
        else: return char
    def quote(self, value):
        return '%s%s%s' % (self.delimiter,
                           ''.join(map(self.escape_char, value)),
                           self.delimiter)

class Constant(SyntaxNode):
    """A syntactic string or number or something."""
    # Ultimately these should evaluate to OCObjects...
    fields = ['value']
    quoter = Quoter('"')
    def evaluate(self, environment): return self.value
    def __str__(self):
        if type(self.value) in types.StringTypes:
            return self.quoter.quote(self.value)
        else:
            return repr(self.value)

class NameQuoter:
    """Helper for quoting identifiers with '' if necessary."""
    quoter = Quoter("'")
    namechars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_$'
    middlenamechars = namechars + '0123456789'
    def has_funny_name(self, name):
        if len(name) == 0: return True
        if name[0] not in self.namechars: return True
        if any([char not in self.middlenamechars for char in name]):
            return True
        return False
    def quote(self, name):
        if self.has_funny_name(name): return self.quoter.quote(name)
        else: return name

namequoter = NameQuoter()

class Variable(SyntaxNode):
    fields = 'name', 'origin'
    def evaluate(self, environment): return environment[self]
    def __str__(self): return namequoter.quote(self.name)
    # Note Record's __repr__ isn't quite right here because it doesn't
    # preserve identity, and because we don't have an __eq__, identity
    # matters!


class ObjectExpression(SyntaxNode):
    """Abstract base class for objects that contain methods."""
    def stringify_methods(self):
        return self.indent('\n'.join(map(str, self.methods)))
    def indent(self, astr):
        lines = astr.split('\n')
        if not lines[-1]: lines = lines[:-1]
        return ''.join('  ' + line + '\n' for line in lines)

class ObjectLiteral(ObjectExpression):
    """An expression for an object that doesn't inherit from anything."""
    fields = ['methods']
    def evaluate(self, environment):
        return OCObject([m.evaluate(environment) for m in self.methods])
    def __str__(self): return '{\n%s}' % self.stringify_methods()

class ObjectDerivation(ObjectExpression):
    """An expression for an object derived from another."""
    fields = 'baseobj', 'methods'
    def evaluate(self, environment):
        return self.baseobj.evaluate(environment).derive(
            [method.evaluate(environment) for method in self.methods])
    def __str__(self):
        return '%s {\n%s}' % (str(self.baseobj), self.stringify_methods())


# This one is a little funny... it's not lexically an expression, it
# "evaluates" to something that isn't a value, its callers
# (ObjectLiteral and ObjectDerivation) know they're evaluating method
# definitions, not expressions, and it doesn't benefit from any of
# SyntaxNode's methods.  So it's not a SyntaxNode.
class MethodDefinition(Record):
    """Part of an object definition defining a single method."""
    fields = 'selfname', 'methodname', 'methodbody'
    def evaluate(self, environment):
        return (self.methodname,
                OCMethod(self.selfname, self.methodbody, environment))
    def __str__(self):
        return '%s.%s = %s' % (self.selfname,
                               namequoter.quote(self.methodname),
                               self.methodbody)

class MethodCall(SyntaxNode):
    """An expression such as {}.bar, calling a method."""
    fields = 'objexpr', 'methodname'
    def evaluate(self, environment):
        return self.objexpr.evaluate(environment).call_method(self.methodname)
    def __str__(self):
        return "%s.%s" % (self.objexpr, namequoter.quote(self.methodname))
    def __rshift__(self, other):
        """Syntactic sugar for constructing a method definition."""
        return MethodDefinition(self.objexpr, self.methodname, other)


### Unit tests below

def ok(a, b): assert a == b, (a, b)
def ensure_exception(thunk):
    try: thunk()
    except: pass
    else: assert 0, "ick"
    
def test():
    # Evaluating object-literal expressions
    emptyobjlit = ObjectLiteral([])
    emptyobj = emptyobjlit.evaluate({})
    ok(emptyobj.methodnames(), [])

    ensure_exception(lambda: emptyobj.call_method('bob'))

    # Defining some methods
    bobself = Variable('self', 'in bob')
    maryself = Variable('self', 'in mary')
    freevar = Variable('freevar', 'from outside')
    fvself = Variable('self', 'in fv')
    assert bobself != maryself
    nonemptyobjlit = ObjectLiteral(
        [MethodDefinition(bobself, 'bob', bobself),
         MethodDefinition(maryself, 'mary', Constant(3)),
         MethodDefinition(fvself, 'fv', freevar)])

    nonemptyobj = nonemptyobjlit.evaluate({freevar: 1973})
    ok(nonemptyobj.methodnames(), ['bob', 'mary', 'fv'])
    ok(nonemptyobj.call_method('bob'), nonemptyobj)
    ok(nonemptyobj.call_method('mary'), 3)
    ok(nonemptyobj.call_method('fv'), 1973)
    ensure_exception(lambda: nonemptyobj.call_method('nonexistent'))

    # Deriving an object; scoping.
    nonemptyobjvar = Variable('nonemptyobj', 'from outside')

    derivedobjexpr = ObjectDerivation(nonemptyobjvar, [
        MethodDefinition(fvself, 'fv2', freevar),
        MethodDefinition(maryself, 'mary', Constant(15))
    ])

    derivedobj = derivedobjexpr.evaluate(
        {nonemptyobjvar: nonemptyobj, freevar: 1976})
    ok(derivedobj.call_method('mary'), 15)
    # see, two values for the same variable from different contours.
    ok(derivedobj.call_method('fv2'), 1976)
    ok(derivedobj.call_method('fv'), 1973)

    # Evaluating method-call expressions.
    methodcallexpr = MethodCall(derivedobjexpr, 'mary')
    ok(methodcallexpr.evaluate({nonemptyobjvar: nonemptyobj}), 15)

    # OK, now for booleans.  From kragen-tol post "How to take
    # parameters in the object calculus?",
    #http://lists.canonical.org/pipermail/kragen-tol/2005-September/000792.html
    # 2005-09-28.
    booleans = Variable('booleans', 'outermost')
    self = Variable('self', 'innermore')
    x = Variable('x', 'innermost')
    truenegated = MethodDefinition(self, 'negated',
                                   MethodCall(booleans, 'false'))
    trueiftrue = MethodDefinition(self, 'ifTrue', ObjectLiteral([
        MethodDefinition(x, 'then', Constant(1)),
        MethodDefinition(x, 'else', Constant(0)),
        MethodDefinition(x, 'val', MethodCall(x, 'then')),
        ]))
    trueiffalse = MethodDefinition(self, 'ifFalse',
                                   MethodCall(MethodCall(self, 'negated'),
                                              'ifTrue'))
    truedef = MethodDefinition(booleans, 'true', ObjectLiteral([
        truenegated,
        trueiftrue,
        trueiffalse,
        ]))

    falsenegated = MethodDefinition(self, 'negated',
                                    MethodCall(booleans, 'true'))
    falseiftrue = MethodDefinition(self, 'ifTrue', ObjectDerivation(
        # Note that this expression erroneously read booleans.ifTrue
        # in the original; it should have read booleans.true.ifTrue
        MethodCall(MethodCall(booleans, 'true'), 'ifTrue'), [
            MethodDefinition(x, 'val', MethodCall(x, 'else')),
            ]
        ))
                                   
    falsedef = MethodDefinition(booleans, 'false', ObjectDerivation(
        MethodCall(booleans, 'true'), [falsenegated, falseiftrue]
        ))

    booleansdef = ObjectLiteral([truedef, falsedef])
    booleans_obj = booleansdef.evaluate({})
    booleans_true = booleans_obj.call_method('true')
    booleans_false = booleans_obj.call_method('false')

    hungry = Variable('hungry', 'original post')
    hungryexpr = MethodCall(ObjectDerivation(MethodCall(hungry, 'ifTrue'), [
        MethodDefinition(x, 'then', Constant("eat")),
        MethodDefinition(x, 'else', Constant("don't eat")),
        ]), 'val')

    ok(hungryexpr.evaluate({hungry: booleans_true}), "eat")
    ok(hungryexpr.evaluate({hungry: booleans_false}), "don't eat")

    # Python syntactic sugar and debug display
    desired_hungry_string = ("hungry.ifTrue {\n" +
                             "  x.then = \"eat\"\n" +
                             "  x.else = \"don't eat\"\n" +
                             "}.val")
    ok(str(hungryexpr), desired_hungry_string)

    # The following is pretty sour as syntactic sugar goes, but it's
    # terser than the earlier expression, and IMHO in a clearer order;
    # it's just that all the operators mean something almost, but not
    # quite, entirely unlike their usual meaning.
    
    other_hungry_expr = hungry['ifTrue'](
        x['then'] >> Constant("eat"),
        x['else'] >> Constant("don't eat")
    )['val']

    ok(str(other_hungry_expr), str(hungryexpr))

    gross_booleans_def = ObjectLiteral([
        booleans['true'] >> ObjectLiteral([
            self['negated'] >> booleans['false'],
            self['ifTrue'] >> ObjectLiteral([
                x['then'] >> Constant(1),
                x['else'] >> Constant(0),
                x['val'] >> x['then'],
            ]),
            self['ifFalse'] >> self['negated']['ifTrue'],
        ]),
        booleans['false'] >> booleans['true'](
            self['negated'] >> booleans['true'],
            self['ifTrue'] >> booleans['true']['ifTrue'](
                x['val'] >> x['else'],
            ),
        ),
    ])

    ok(str(gross_booleans_def), str(booleansdef))

    # Nested indentation
    ok(str(booleans(self['true'] >> booleans['true'](
        self['and'] >> self['ifTrue']))),
    "booleans {\n" +
    "  self.true = booleans.true {\n" +
    "    self.and = self.ifTrue\n" +
    "  }\n}")

    # Strings should always be displayed with double quotes.
    ok(str(Constant("boo")), '"boo"')
    ok(str(Constant("doesn't")), '"doesn\'t"')
    ok(str(Constant('"yes"')), r'"\"yes\""')
    ok(str(Constant(r'\"yes\"')), r'"\\\"yes\\\""')

    # Funny names should be displayed with single quotes.
    ok(str(Variable("haha", 'py')), "haha")
    ok(str(Variable("ha ha", 'py')), "'ha ha'")
    ok(str(Variable("don't", 'py')), r"'don\'t'")
    ok(str(Variable("\\", 'py')), "'\\\\'")
    ok(str(Variable('a0', 'py')), "a0")
    # Really evil things to name your variables:
    ok(str(Variable('', 'py')), "''")
    ok(str(Variable('0', 'py')), "'0'")

    # Funny method names too.
    ok(str(hungry['man']), 'hungry.man')
    ok(str(hungry['m n']), "hungry.'m n'")

    # Also in method definitions.
    ok(str(hungry['man'] >> Constant("pancake")),
        'hungry.man = "pancake"')
    ok(str(hungry['m n'] >> Constant("pancake")),
        'hungry.\'m n\' = "pancake"')
    haha = Variable("ha ha", 'py')
    ok(str(haha['ha ha'] >> haha['ha ha']),
        "'ha ha'.'ha ha' = 'ha ha'.'ha ha'")

test()

Thu, 06 Oct 2005

So I consulted briefly on a project that was using a semistructured
data store for everything.  Everything was a graph, and they were
having a hard time getting it to all run at a reasonable speed,
because every time they touched a node, they issued a SQL query to
pull all the edges related to it out of the database.  Since typical
pages showed a few important properties from a large number of objects
(think: first and last names of all of your friends), every page
actually copied a substantial fraction of the data out of the
database.

On the other hand, it seemed that if you composed a SQL query (some
horrendous multi-way self-join) that fetched only the needed data from
the database, MySQL's query optimizer would do a good enough job
optimizing the query that it would fetch the needed data in a small
fraction of a second.

I'd experienced this before on a database-backed web site project, and
the cause was that the fetch_friends() method (or get_all_rogue_aps()
or whatever) didn't have enough knowledge of the context it was being
called in to know what other information its caller would want out of
the database; but it had to fetch the results immediately so its
caller could use them.  Using MySQL merely as a triple store, as
Class::RDF does, just made the problem ten times worse.

On my way to the grocery store, I was thinking about this, and about
Erlang's nested "IO lists" of characters for constructing output, and
about HTML templates, and my own implementation of part of the
relational algebra in Python with lazy compilation to SQL ("prototype
SchemeQL/Roe-like thing in Python",
http://lists.canonical.org/pipermail/kragen-hacks/2004-April/000394.html).

And it occurred to me that if you filled in your HTML template with
lazy queries ("for each 'friend' edge, spit out this chunk of HTML
which interpolates the 'firstname' and 'lastname' values from its
context") then you could actually wait until you'd constructed the
whole page and were running the template engine before you hit the
database.  You could run all kinds of imperative code to create these
lazy query chunks and paste them together, but without actually
hitting the database.

That means that you have all the information required to compile
everything down into an efficient SQL query (or two or three).

You can't do this, of course, for information that you have to fetch
from the database in order to direct the flow of some imperative code,
but you can usually avoid doing that an arbitrary number of times per
page.

If you merely want to display each person's last name or not,
according to whether they're your friend, your imperative code can
create a conditional operator node from the "friend" status, the last
name, and the empty string (or whatever you use instead of the last
name).  This will compile into a query that fetches both the friend
status and the last name for each person; the database will fetch all
the last names, even the ones you don't want to display, and generally
this will be pretty efficient.

But for decisions like whether to do a particular update to the
database or to take some other action in the world, you must do
non-lazy database queries from your imperative code, which means they
can't be batched and optimized into a single query.  This usually
isn't the pain point, though.

This approach doesn't break even if your "templating" is doing
arbitrarily complex stuff internally, in whatever imperative or
functional style you like, as long as it doesn't need more data from
the database.

This might be a fun way to build a SnikiSniki-like system.

Mon, 03 Oct 2005

The Oxford English Dictionary, generously supported by the Oxford
University Press, is one of the earliest instances of what are now
called "pro-am" or "commons-based peer production" projects.  From
1857 to 1928, thousands of readers collected examples of uses of words
their dictionaries didn't define; they mailed these examples on slips
of paper to a small number of editors, who undertook to collate them
into a dictionary.  From 1884 to 1928, these editors published their
work in fascicles, mostly in alphabetical order.
<http://en.wikipedia.org/wiki/Oxford_English_Dictionary --- Wikipedia
article "Oxford English Dictionary">

In recent years, with the advent of public access to the internet, it
has become apparent that commons-based peer production works best when
no single party can restrict the uses of the end product; more people
can use it, it can be put to more uses, poor coordinators can be
replaced, and contributors have assurance that they will be able to
use their own work.  <http://perens.com/Articles/Economic.html ---
"The Emerging Economic Paradigm of Open Source", by Bruce Perens;
http://www.benkler.org/CoasesPenguin.html --- "Coase's Penguin, or
Linux and the Nature of the Firm, by Yochai Benkler>

This form of commons-based peer production of information, in which
the end product can be studied, copied, modified, and used freely, is
often called "Open Source
development". <http://opensource.org/docs/definition.php --- "The Open
Source Definition, Version 1.9", promulgated by the Open Source
Initiative; http:///www.catb.org/~esr/writings/cathedral-bazaar/ ---
"The Cathedral and the Bazaar", by Eric S. Raymond> It got this name
because it started with software whose source code was freely
available for all these purposes, also known as "free software"
<http://www.fsf.org/ --- the Free Software Foundation>.

Tim Bray, the world-famous hacker who co-invented XML, explains how
the OED is not currently open source:

    Well, literally thousands of people around the world diligently
    read books looking for usages of words and writing them on slips
    and sending them to Oxford. Many, many millions of these things
    are in filing cabinets in the basement of Oxford. Then Oxford, of
    course, turned them around to do a commercial product. It's not
    as though the underlying citation store or the dictionary itself
    are open for free access to anybody except for Oxford.

    So I don't think it's really open source in some of the essential
    characteristics. It is certainly community-based and
    community-driven. And it clearly became the case that some of the
    unpaid volunteers became thought leaders in terms of how you go
    about finding things.

<http://www.acmqueue.com/modules.php?name=Content&pa=printer_friendly&pid=282&page=1
--- "A Conversation with Tim Bray", ACM Queue, Vol. 3, No. 1, February 2005>

If the Oxford English Dictionary were Open Source, we could expect the
following improvements:
- Definitions would be available in many contexts; for example, within
  a word processor, at the command line, in a web browser.
- OED definitions and etymologies would be available to many more
  people, so many more people would think about how they needed
  improvement.
- When a person noticed a bad definition or an opportunity for
  improvement, they could immediately fix it in their local copy of
  the dictionary, and later, share their improvement with others who
  were interested.  This is particularly important because the OED is
  quite out of date, especially the parts in the public domain.
- Definitions and etymologies could be augmented with unlimited
  examples of use, drawn from the English literary canon (via Project
  Gutenberg) on demand.
- People could develop innovative software for looking up definitions;
  for example, it could disambiguate misspellings according to context
  of use, and preferentially display word senses that might apply in
  context (noun versus verb, for example, or by the publication year
  or country of origin of the work containing the unknown word, if
  that's available.)
- Web sites such as http://www.snopes.com/ could link to authoritative
  definitions and etymologies of words, and even quote them in full
  without fear of copyright infringement.
- Its English-language definitions could be translated into other
  languages (perhaps incrementally, as people requested them) to
  supplement existing inter-lingual dictionaries, or perhaps even
  create new ones.

I have been investigating what would be required to make the OED Open
Source.  Much of the first edition is out of copyright; in general,
anything published before 1923 is in the public domain in the US and
in Berne Convention countries.  Someone could take this
out-of-copyright text and create a public-domain or
open-source-licensed version thereof.

The fascicle 'W-Wash' was published in 1921
<http://www.colbycosh.com/old/december02.html ---
http://oed.com/pdfs/oed-news-2002-06.pdf --- article "J.R.R. Tolkien
and the OED", Oxford English Dictionary News, Series 2, Number 21,
June 2002, by Peter Gilliver, pp.1-3>; this suggests that nearly the
entire dictionary is out of copyright, in the form in the fascicles.
However, I don't know how to get hold of them, and the Wikipedia
article cited above mentions that the first one sold only 4000 copies
--- so there may be fascicles of which no copies survive.  For
example, none are listed on http://www.abebooks.com/ as far as I can
tell; I searched on "new english dictionary historical" and "fascicle
dictionary english" with little luck.  Searching for "new english
dictionary historical principles" in the title, however, I did find
several volumes supposedly published before 1921, at very reasonable
prices --- US$30-US$130.  (I found advertisements for volumes D-E,
H-K, L-N, Q-R, S-SH, V-Z, X-ZYXT, all claiming to be from before 1923,
comprising nearly half of the first edition OED.)

If someone were to take the original pages, of which I guess there are
around ten thousand, photograph each one with a cheap five-megapixel
digital camera, and compress the result, each page image would
probably be around a megabyte and take about ten seconds to produce;
the entire set would require only about ten gigabytes and about 30
hours of labor to produce.  It could then be distributed by BitTorrent
and DVD-R's.  

The Internet Archive's Texts Collection's Million Books Project
<http://www.archive.org/details/texts> consists of books scanned in
more or less this manner, although they are using expensive
sixteen-megapixel cameras because they have a wider range of uses in
mind.  The sample book I'm looking at is one bit deep and is
compressed by about a factor of 24 --- 558 pages, 8.5 megapixels each,
which would be 600 megabytes uncompressed, but is 24.4 megabytes
compressed with DjVu --- still around a megabyte per page.
<http://www.archive.org/details/ChurchDictionary --- Walter Parquhar
Hook's 1842 Church Dictionary>.

By itself, such a collection of images would be slightly less useful
than the original books, although much more easily reproduced.  You
would still have to page through the collection of numbered pages one
by one to find the page containing the word you wanted, but the images
would be displayed on a conventional small low-resolution computer
monitor rather than a large, high-resolution book page.  Consequently
it would be somewhat slower to access than the original books.

However, once the page images were available in the public domain, it
would be possible at any later time, and in any small increment, to
annotate them with OCR results or hand-written transcriptions, which
could also be corrected by people consulting it.

(My experiments with free-software OCR have not been terribly
encouraging so far.)

Sat, 01 Oct 2005

# Some thoughts on prototype programming, transcribed from a notebook

# Notation:
# x.a is the value of the field "a" from the object "x".

# { a = 1, b = 2 } is an object with two fields; equivalent to
# {
#     a = 1
#     b = 2
# }
# and also { self.a = 1, this.b = 2 }.  The field definitions are
# really method bodies; the name before the "." is the name they can
# use to refer to the object they're being called on.  You could write
# another object with the same two fields and values as 
# { a = 1, banana.b = banana.a + 1 }.

# You can "derive" from an object by writing the object followed by 
# { } and some more field definitions.  The resulting object will have
# the specified fields overridden.  So { a = 1, x.b = x.a + 1 } { a = 3 }
# is another way to write { a = 3, x.b = x.a + 1 }.  (In general,
# because of lexical scoping, you can't always evaluate by textual
# substitution in this way.)

# Objects not explicitly derived from anything else are derived from
# "object".

# You can surround something non-wordlike with '' to make it a word
# you can use as a field name or object name.  This way you can define
# fields with funny names.  For example, { '{}#@%@!' = 3 } has one
# field, named "{}#@%@!".

# "foo(stuff)" is just a shorthand for "foo{stuff}.'()'".  "{a, b, c}"
# is just a shorthand for "{ arg1 = a, arg2 = b, arg3 = c }".

# You can write strings surrounded by "".  They derive from an object
# known as "string", which derives from "object".

# In various places here I use infix operators, such as "+", "%",
# "==", "*", and "<".  The intent is that "a + b" is shorthand for
# "a.'+'(b)", i.e. "a.'+'{arg1=b}.'()'", but nothing here yet tries to
# override these methods.  Most of the operators would be implemented
# as primitives and are self-explanatory, but the "%" on strings is
# not.  It's inspired by the Python string "%" operator, but works
# differently; it replaces curly-bracketed names in its left operand
# with the named fields in the right operand.

# "a[b]" is intended to be shorthand for "a.'[]'(b)".  Indices are
# zero-based; I've borrowed Python's slice syntax; foo[bar:baz] is a
# subsequence of foo, containing foo[bar], foo[bar+1], foo[bar+2],
# etc., up to foo[baz-1].  foo[bar:] is the same as
# foo[bar:foo.length], and foo[:baz] is the same as foo[0:baz].  I
# don't know how this syntax should work.



# In this file, the names given to self parameters in top-level
# definitions denote the objects on which those names are defined.

# recursive string.split

string.split = {
    delim = arg1 = ","    # syntax for making "delim" an alias for arg 1
    split.pos = string.index(split.delim)
    split.before = string[:split.pos]
    split.after = string[split.pos + split.delim.length:]
    split.recursion = split.after.split(split.delim)
    split.'()' = split.pos.ifNull(then=[string], else=[split.before] + recursion)
}

# ifNull, ifTrue don't need to be primitives
true.ifTrue = {
    then = true
    else = false
    self.'()' = self.then
}

# false's ifTrue is just like true's, except that it overrides the
# return value
false.ifTrue = true.ifTrue { self.'()' = self.else }

# prototype-based functions give us currying, without the mental overhead
object.ifNull = (object is null).ifTrue

# Here's something like C's "for" loop.
# for (state; !done; ...) { ... }
loop = {
    state = { i = 0 }
    loop.done = loop.state.i == 10
    loop.i = loop.state.i + 1
    loop.'()' = loop.done.ifTrue(then=loop.state, else=loop(state=loop))
}
# Perhaps it should say "then=loop.state.result" instead of
# "then=loop.state".  That could flatten the following routine into
# just a derivation from "loop".  I'm not sure what object "loop"
# should be defined on.

string.split = {
    delim = arg1 = ","
    s.loop = loop(
        state = { rv = [], str = string }
        loop.pos = loop.state.str.index(s.delim)
        loop.done = loop.pos is null
        loop.pre = loop.state.str[:loop.pos]
        loop.rv = loop.state.rv + [loop.pre]
        loop.str = loop.state.str[loop.pos + s.delim.length:]
    )
    s.'()' = s.loop.rv + [s.loop.str]
}
# I think there's a bug there when pos is initially null

# often we want some object, or if it's null, some substitute.  People
# often do this in Perl with 'or' or '||'.
object.nullis = {
    default = arg1 = null
    nullis.'()' = object.ifNull(then=nullis.default, else=object)
}

# Here's a URL manipulation object.  It contains five derivations from
# non-class-like objects, in essentially the places where a
# non-functional OO language would mutate itself.
url = {
    scheme = "http"
    host = "www.example.com"
    port = 80
    path = "/"
    query = null
    fragment = null
    username = null
    password = null

    url.hostport = url.port.ifNull(then=url.host, else="{host}:{port}" % url)
    url.auth = url.password.ifNull(
        then = url.username
        else = "{username}:{password}" % url{username=url.username.nullis("")}
    )

    url.asString = "{scheme}://{netloc}{path}{qpart}{fpart}" % url{
        netloc = url.auth.ifNull(
            then = url.hostport
            else = "{auth}@{hostport}" % url
        )
        qpart = url.query.ifNull(then="", else="?" + url.query)
        fpart = url.fragment.ifNull(then="", else="#" + url.fragment)
    }

    url.omitDefPort = url{
        port = (url.port == 80).ifTrue(then=null, else=url.port)
    }
    url.explicitPort = url{port = url.port.nullis(80)}
    url.withHostport = {
        arg1 = "example.com:8000"
        my.hostport = my.arg1.split(":")
        my.host = my.hostport[0]
        my.port = (my.hostport.length < 2).ifTrue(
            then = null
            else = my.hostport[1].asInt
        )
        my.'()' = url{host=my.host, port=my.port}
    }
}

# this API allows
url.withHostport("lula:1000").port   # --> 1000
url{host="lula", port=80}.omitDefPort.hostport   # --> "lula"

# When I wrote this, I thought that a non-prototype-based interface
# probably would need a separate "hostport" class for this kind of
# manipulation, but I don't think that's true.

# first thing I scribbled down in this notation on 2005-07-21
{
    fact = {
        arg = 1
        f.val = arg.equals(0).ifTrue{
            then = 1
            else = f.arg.times{
                f(arg = f.arg.minus(1)
            }
        }
}

# corrected and modernized, this is
factorial = {
    arg1 = 1
    f.'()' = (arg1 == 0).ifTrue(
        then = 1
        else = f.arg1 * f(f.arg1 - 1)
    )
}


# ok, so here we go again with string.split

object.loop = {
    prev = { i = 0 }
    loop.i = loop.prev.i + 1
    loop.done = loop.i == 10
    loop.'()' = loop.done.ifTrue(then=loop.prev.result, else=loop(prev=loop))
}

string.split = string.loop {
    delim = arg1 = ","
    self.prev = self { rv = [], left = string }
    loop.pos = loop.prev.left.index(loop.delim)
    loop.done = loop.pos is null
    loop.rv = loop.prev.rv + [loop.prev.left[:loop.pos]]
    loop.left = loop.state.str[loop.pos + loop.delim.length:]
    loop.result = loop.rv + [loop.left]
}

# compare Python more-or-less equivalent:
# def splitstring(string, delim=","):
#     rv = []
#     left = string
#     while 1:
#         pos = left.index(delim)
#         if pos == -1: return rv + [left]
#         rv.append(left[:pos])
#         left = left[pos + len(delim):]
# ... not much of an improvement.  

# In "contextual" form, where most of the self-names are elided:

split = loop {
    delim = arg1 = ","
    self.prev = self { rv = [], left = string }
    pos = prev.left.index(delim)
    done = pos is null
    rv = prev.rv + [prev.left[:pos]]
    left = state.str[pos + delim.length:]
    result = rv + [left]
}

# string.index finds one string in another; the naive algorithm:

string.index = string.loop {
    needle = arg1 = "foo"
    self.prev = self { i = 0 }
    outer.found = string.loop (
        prev = self { i = 0, equal = true }
        in.equal = string[in.i + outer.i] == outer.needle[in.i]
        inner.done = (!inner.equal) or inner.i >= outer.needle.length
    )
    outer.done = outer.found or outer.i > string.length - outer.needle.length
    outer.result = outer.found.ifTrue(then=outer.i, else=null)
}

# typedef struct { char *s; int length; } string;
# int index(string *haystack, string *needle) {
#     int ii, jj;
#     for (ii = 0; ii <= haystack->length - needle->length; ii++) {
#         for (jj = 0; jj < needle.length; jj++) {
#             if (haystack->s[ii + jj] != needle->s[jj]) goto nextloop;
#         }
#         return ii;
#     nextloop:
#     }
#     return -1;
# }