Mon, 30 Apr 2007

Hi Kragen,

* Kragen Javier Sitaker <kragen at pobox.com> [2007-04-30 09:40]:
> I suspect that explanation #5 is the correct explanation of the
> term's origin, and it's prefigured by the "Unices" and
> "Twenices" examples from the Jargon File [4], but I intend to
> avoid the use of "redices" except in clearly playful contexts
> because of the possibility of interpretations #1, #2, #4, and
> especially #6, which will make it more difficult for the
> listener to discover the correct derivation from "reducible
> expression".  Maybe this is just me taking myself way too
> seriously.

my first reaction would be to suggest “redexen”, along the lines
of “regexen”. According to your exposition this is a lot “less
incorrect”, and I think the “-en” ending is far more obviously an
attempt at willful idiosyncracy than is the “-ices” form of Latin
words.

(I suppose the difference is in the perception of their origins,
lending the high-brow air of Latin to one vs the brutish cut of
Germanic to the other. Observe the kinds of words for which the
dictionary lists one or the other form as the correct plural to
see where that comes from.)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

I recently corrected someone who used the term "redices" to mean
"reducible expressions" (in the context of the lambda-calculus and
similar formal systems):

    "redices" is not the plural of "redex".  If there were a Latin word
    "redix", it might pluralize as "redices", but there isn't, and
    "redix" is different from "redex" anyway.

He pointed out:

    I think "redices" is fairly common use, as a googling
    confirms. . . . What makes you think "redices" isn't the plural of
    "redex"?  How are you with "indices"?

"Index"
-------

I had to admit that he was right about "index" not being "indix".  I
don't speak Latin myself, but a friend of mine who does explained to
me that "index" and "indices" are the nominative singular and plural
of "index", a regular third-declension Latin noun.  Charlton Lewis's
Latin dictionary has the following entry for it [1]:

    index dicis, m and f [in+DIC-] , one who points out, a discloser,
    discoverer, informer, witness: falsus, S.: haec omnia indices
    detulerunt.-- An informer, betrayer, spy: vallatus indicibus:
    saeptus armatis indicibus: silex, qui nunc dicitur index,
    traitor's stone, O.--An index, sign, mark, indication, proof:
    complexus, benevolentiae indices: vox stultitiae: auctoris
    anulus, O.: Ianum indicem pacis bellique fecit, L.--A title,
    superscription, inscription: deceptus indicibus librorum: tabula
    in aedem cum indice hoc posita est, L.--A forefinger, index
    finger: pollex, non index: indice monstrare digito, H.

I don't know the etymology of the term "redex" for certain, but it
means "reducible expression", and is therefore an originally English
word, not a Latin loanword.  It doesn't appear in the Oxford English
Dictionary Online.

English Pluralization
---------------------

English is somewhat unusual in that it often imports irregular
pluralizations of loanwords along with the original loanwords --- thus
we have mujahedin, Taliban, tableaux, criteria, cherubim, axes, and
bacteria, rather than *mujahids, *Talibs, *tableaus, *criterions,
cherubs, *axises, and *bacteriums. [2] This adds to the confusion of
irregular plurals already natively present in English, things like
"oxen", which was in Old English, coming from Proto-Germanic and
ultimately from Proto-Indo-European. [3] 

As illustrated above, there are many cases in which the regular plural
form is considered incorrect by almost everyone, but there are other
cases where use of the irregular "classical" form is a way to show off
the speaker's erudition; I think "index" and "cherub" are such
examples.  Both "indexes" and "cherubs" are legitimate English, but
"indices" and "cherubim" are ways to demonstrate your erudition, and
perhaps your knowledge of Latin and Hebrew.  Damian Conway's article
"An Algorithmic Approach to English Pluralization" [7] lists several
more examples.

Showing off one's knowledge may be thought to demonstrate an arrogant
attitude of superiority, if those you're talking to don't share that
same knowledge, or to demonstrate that you belong in the group, if
they do.  However, in either case, it's worse if the folks you're
talking to know that the knowledge you're showing off is wrong.

The Jargon File mentions deliberately irregular pluralizations in
hacker jargon [4]:

    Further, note the prevalence of certain kinds of nonstandard
    plural forms. Some of these go back quite a ways; the TMRC
    Dictionary [from the 1950s?] includes an entry which implies that
    the plural of 'mouse' is meeces, and notes that the defined plural
    of 'caboose' is 'cabeese'. This latter has apparently been
    standard (or at least a standard joke) among railfans (railroad
    enthusiasts) for many years.

    On a similarly Anglo-Saxon note, almost anything ending in 'x' may
    form plurals in '-xen' (see VAXen and boxen in the main text)
    [following "oxen"]. Even words ending in phonetic /k/ alone are
    sometimes treated this way; e.g., 'soxen' for a bunch of
    socks. Other funny plurals are 'frobbotzim' for the plural of
    'frobbozz' (see frobnitz) and 'Unices' and 'Twenices' (rather than
    'Unixes' and 'Twenexes'; see Unix, TWENEX in main text). But note
    that 'Unixen' and 'Twenexen' are never used; it has been suggested
    that this is because '-ix' and '-ex' are Latin singular endings
    that attract a Latinate plural. Finally, it has been suggested to
    general approval that the plural of 'mongoose' ought to be
    'polygoose'.

    The pattern here, as with other hackish grammatical quirks, is
    generalization of an inflectional rule that in English is either
    an import or a fossil (such as the Hebrew plural ending '-im', or
    the Anglo-Saxon plural suffix '-en') to cases where it isn't
    normally considered to apply.

    This is not 'poor grammar', as hackers are generally quite well
    aware of what they are doing when they distort the language. It is
    grammatical creativity, a form of playfulness. It is done not to
    impress but to amuse, and never at the expense of clarity.

One might also speculate that hackers do this to poke fun at those who
think that knowing Latin declensions makes them smart.

Other, non-hacker occurrences of the same playful misapplied Latin
pluralization have been reported, such as "grimi" for "grimaces" or
"waitri" for "waitresses". [6]

Back to "Redices"
-----------------

So I can think of six likely interpretations a listener might arrive
at when they hear "redices" used as the plural of "redex":

1. The speaker is trying to demonstrate that they're better-educated
   than I am, but they are failing, because they don't know enough
   Latin to know that "redex" isn't a Latin word.  Therefore, they are
   not only arrogant but unskilled and unaware of it [5].

2. The speaker is trying to demonstrate that they are as well educated
   as I am, but they are failing, because they don't know enough Latin
   to know that "redex" isn't a Latin word.  Therefore, they think
   they are not worthy of associating with me, because they think they
   would need to have a classical education in order to be so, but
   they clearly don't have it.  Furthermore, they are unskilled and
   unaware of it.

3. The speaker uttered an incomprehensible word.  They must have a
   bigger vocabulary than I do; maybe they are smart and I should
   listen to them more carefully.

4. The speaker uttered an incomprehensible word.  They must be talking
   nonsense.

5. The speaker is playfully forming a nonstandard plural.

6. This word "redex" I haven't heard before must be a Latin word, and
   "redices" is either its only correct plural or a correct show-off
   academic plural.

I suspect that explanation #5 is the correct explanation of the term's
origin, and it's prefigured by the "Unices" and "Twenices" examples
from the Jargon File [4], but I intend to avoid the use of "redices"
except in clearly playful contexts because of the possibility of
interpretations #1, #2, #4, and especially #6, which will make it more
difficult for the listener to discover the correct derivation from
"reducible expression".  Maybe this is just me taking myself way too
seriously.

Correcting People
-----------------

If you correct people who are using "redices" (as I did), you run the
risk of a similarly hazardous gamut of responses.

1. Why is he trying to show off his knowledge?  Doesn't he know
   "redices" is a playful invention?  He must be not only arrogant but
   unskilled and unaware of it.

2. I've always heard "redices" as the plural of "redex".  Have I been
   looking dumb all these years?

Oops.

References
----------

These references are not intended to assign credit; they're just here
so you can dig deeper if you're interested.

[1] Charlton Lewis, "An Elementary Latin Dictionary", 1890, ISBN:
0199102058
> http://perseus.mpiwg-berlin.mpg.de/cgi-bin/ptext?doc=Perseus%3Atext%3A1999.04.0060%3Aentry%3D%237793

[2] Letter from Dr. Lim Chin Lam, Penang, to The Star of Malaysia
newspaper, 2006-08-25:

    It must be noted that the above nouns have been adopted (or
    borrowed or hijacked) from other languages and normally retain the
    singular and plural forms in their original language.

> http://thestar.com.my/english/story.asp?file=/2006/8/25/lifefocus/15088814&sec=lifefocus

[3] Douglas Harper's Etymonline Online Etymology Dictionary entry "ox"
> http://etymonline.com/?term=ox

[4] Jargon File 4.2, dated 2000-01-31, attributed to a large
collection of hackers but currently enclosed by Eric Raymond, section
"Jargon Construction", subsection "Overgeneralization";
> http://www.science.uva.nl/~mes/jargon/o/overgeneralization.html

[5] "Unskilled and Unaware of It: How Difficulties in Recognizing
One's Own Incompetence Lead to Inflated Self-Assessments", by Justin
Kruger and David Dunning, published in the Journal of Personality and
Social Psychology, December 1999, Vol. 77, No. 6, 1121-1134
> http://www.phule.net/mirrors/unskilled-and-unaware.html
> http://gagne.homedns.org/~tgagne/contrib/unskilled.html
> http://www.apa.org/journals/features/psp7761121.pdf

[6] "More False Latin", by John Algeo, American Speech, Vol. 41, No. 1
(Feb., 1966), pp. 72-74, doi:10.2307/453250
> http://links.jstor.org/sici?sici=0003-1283(196602)41%3A1%3C72%3AMFL%3E2.0.CO%3B2-G

[7] "An Algorithmic Approach to English Pluralization", by Damian
Conway; this describes the algorithm in Lingua::EN::Inflect.
> http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html

Sat, 28 Apr 2007

On Thu, 2007-04-26 at 03:37 -0400, Kragen Javier Sitaker wrote:
> I've said several times that I prefer low-tech stuff, much to the
> shock of some of my co-workers; I'm not sure how to explain this
> adequately.
> 
> The more I work with computers, the more I realize that complexity has
> hidden costs --- complex things are less reliable, fail in more
> unpredictable ways, are harder to diagnose problems in, are often
> harder to fix, are usually more trouble to keep running, are harder to
> change (especially to change without breaking), and are harder to get
> to work with other things.  These alone are significant reasons to
> prefer a simpler solution to a more complex one unless the complex
> solution has significant advantages.

I think you might be overstating your case here.  Or maybe you mean some
of these things in a more narrow sense than I realize.

I often find myself frustrated with the volume controls on various kinds
of phones.  They don't go low enough, or they don't go high enough, or I
want a volume level in between two of the settings available, or the
interface for changing it sucks and is hard to use while you're actually
on the phone, which is the only time you're going to use it.  But I have
no idea what I can do about any of this.  By contrast, even though the
Gecko codebase is big and a little intimidating, I have done at least a
bit of hacking on that to disable a feature I didn't like.  So while I
won't claim that Gecko is easy to change, I think the phone is harder to
change -- because I have no idea how to fix any of these volume
problems.  It's all made of hardware that I don't understand and may be
very difficult or impossible to change.

In a similar vein, the way a lot of low-tech things are designed these
days, it seems like they're designed to be something cheap that you
don't think about replacing, instead of being something that's easy to
fix.  A lot of modern keyboards are built this way; unless you've got a
Model M or one of its descendants, if you spill something on the
keyboard, you're probably buying a new one.  My cell phone recently
broke; I didn't have the necessary skills to replace it, and Sprint
didn't seem too interested in doing it either.  I don't think this
contradicts your later points about the dangers of complex design,
though -- I'm sure Sprint was more than happy to have me instead spend a
lot of time reading their marketing materials, picking a new plan,
buying a new phone, and signing a new contract.

I realize these aren't particularly good low-tech examples.  I guess
that's partly because everything's getting more high-tech.  Probably the
lowest-tech electronic thing on my desk is an alarm clock, and even it's
pretty complicated.  Two alarm settings, plus a nap timer, with the
ability to have either buzzer or radio alarms for all of them.

Maybe I just think software's easier to change because I know how to do
it, whereas I know next to nothing about hardware.  I've never done any
soldering or anything like that.  But I think this, in turn, is partly
because it's easier to learn how to program.  The only prerequisite is a
computer, time, willingness, and dedication.  Whereas learning to mess
with hardware takes time, willingness, dedication, and a wide array of
much more specialized tools, plus space to do the work in, plus probably
some things I don't even know about.

I don't disagree with any of the ultimate points you made in your
e-mail.  I'm just not sure some of these specific points you make in
preference to low-tech are as broadly applicable as they sounded to me.

-- 
Brett Smith

I wanted to get some understanding of the alternatives available on
MercadoLibre for buying a laptop here in Argentina, so I wrote this
program to help me slice and dice it.

I wrote the program in March, but I was hoping to clean it up a bit
before posting it, since now I know some easy-to-fix usability problems
in the UI.

The program is downloadable from
http://pobox.com/~kragen/sw/laptoptable.py and you can try the DHTML
output out at http://pobox.com/~kragen/sw/laptoptable.html --- maybe less
than optimally useful.  I think it works in Firefox and Safari; it does
not work in Konqueror from KDE 3.5.5.

Like everything else posted to kragen-hacks without a notice to the
contrary, this program is in the public domain.  The images (not
included in this post) are part of Mozilla and licensed under the
relevant Mozilla licenses.

Other interesting notes:
- contains yet another RFC-822 parser.  When am I going to stop writing
  these?
- contains a tiny model of Nevow.stan, but as Daniel Martin pointed out
  in http://dtm.livejournal.com/33960.html?thread=44968#t44968 it would
  be better if the contexts in which things were found determined how
  they were escaped, rather than the things themselves determining it
  (although the things themselves need to determine whether they are
  already in HTML or would need to be escaped to become HTML).  I'm not
  sure whether this code does that or not; requires more thought.

#!/usr/bin/python
# This program gives you a dynamic query view on a small amount of
# tabular data --- a page or two.

# UI improvements and bugs to fix for current functionality:
# - It breaks the Back button.
# - Be smarter about composing impossible or useless criteria.  If
#   there are multiple criteria on the same column, the user can
#   select the wrong one, and changing an "is at least" criterion is
#   unnecessarily difficult; it probably would be better to unify them
#   into a single criterion for the column.  It's also really easy to
#   put yourself in a situation where there are no visible rows, which
#   can make it hard to tell what's going on.  Here's what I'm
#   thinking:
#   - adding a new value to an "is" criterion: already not possible.
#   - adding a new value to an "is at least" or "is at most"
#     criterion: make the criterion into a range criterion, "is
#     between X and Y".
#   - adding a new value to an "is between X and Y" criterion: make
#     it into an "is" criterion, I suppose.  If we do enough with the
#     URL, maybe we could allow undo with the back button.
#   - clicking on an empty value: I'm not sure what to do in this
#     case.  Maybe change "is" to "is empty or" and the like?
#   - add a drop-down to add new values to an "is" criterion.
#   - the above also suggests that you should have a drop-down option
#     from "between" to "is at least" top of range or "is at most"
#     bottom.  Maybe still display as two separate criteria with
#     separate drop-downs.
#   - also make it possible to select most of these criteria from a
#     popup menu on the table cell.
# - Change "is empty" to a kind of relationship.
#   - which will avoid "is empty or empty".
#   - also we want "is not empty".
# - Use animation for every change.  It's hard to tell what's
#   happening when your screen all changes at once; it's better for
#   menus to fade in or pop up with animations, rows and columns to
#   shrink down to nothing or grow up from nothing, perhaps even for
#   sorting to move the rows up and down.  Also, an animation moving
#   from the clicked point to the descriptive paragraph up top could
#   help with alerting the user that the paragraph has changed --- or
#   at least highlight the new element.  However, all of this has to
#   be very quick, sub-100ms, so 5 frames of animation at most.
# - Be smarter about composing relationship menus: a single-click on
#   the relationship name should execute a likely change, to be undone
#   with another single-click, and if you're in an "empty or" state,
#   the normal options below should respect that.  Likely changes
#   might be:
#   - "is" to "is not" or "is empty or"
#   - "is at least" to "is at most"
#   - "is between" to "is not between"?
#   - "is one of" to "is not one of"?
# - Also, it may be that menu items should be bigger, especially
#   those farther off.
# - Position the relationship menu correctly; right now the next menu
#   item creeps up onto the bottom edge of the current selection.
# - Adjust the size of the underlying relationship display when the
#   menu pops up, so that the menu doesn't obscure the quantity being
#   compared when the relationship is "is".
# - Keep the paragraphs at the top and the table headers from
#   scrolling off the screen.  (How does TrimSpreadsheet do it?)
# - Better graphics!  Check openclipart.org.
# - Change sorting to use an algorithm that works on both numbers and
#   non-numbers, instead of choosing per column.  The usual approach
#   is to split each string into alternating numeric and non-numeric
#   fields, with a specified one of them always coming first, then
#   compare the resulting tuples lexicographically.  This sucks on
#   hexadecimal numbers but is better the rest of the time.

# More functionality to add:
# - count rows
# - substring text search --- by default across all fields, but
#   changeable to be field-specific.
# - "is one of" and "is not one of" relationships
# - "empty or" for < and > comparisons
# - dividing into sections according to a field (with a popup menu on
#   the table headers similar to the one on the relationship menu)
# - editing and adding values (perhaps with a popup menu on the table
#   cells, and a + at the right edge of the headers?)
# - adding per-section aggregate functions: min, max, mean, mode,
#   sum, count (again, on the table-header popup menu)
# - relational factoring (a la DabbleDB)
# - Excel import (a la Jot and Dabble)
# - named views (bookmarking URLs is not enough)
# - persistence of data (see how TiddlyWiki does it; maybe also Dojo
#   Storage)
# - persistence of views (maybe just in the form of persistence of URLs)
# - move all the Python code into JavaScript
# - DabbleDB-style "summary" column showing all the columns not
#   explicitly in the view
# - "leftovers" column to display mostly-null data (very similar idea)
# - handling larger data sets
#   - for example, it would be cool to have an oerlap-like overview,
#     per-section or for the whole table: most common three values
#     for each field and their number of occurrences
#   - and of course you have to be a little lazier
#   - and maybe you could even use some materialized views to speed
#     OLAPpy things along a bit.
# - "is more than" and "is less than" comparisons

import StringIO, sys

def rfc822parse(lines):
    "Parse a file of short RFC-822 header blocks separated by blank lines."
    rv = {}
    lastkey = None
    for line in lines:
        line = line.strip()
        if not line:
            if rv: yield rv
            rv = {}
        elif line[0] in ' \t':
            rv[lastkey] += ' ' + line.strip()
        else:
            lastkey, val = line.split(':', 1)
            rv[lastkey] = val.strip()

### HTML production, modeled after my memory of Nevow.

def escape(astring):
    "Remove special HTML characters from astring."
    return (astring.replace('&', '&amp;')
                   .replace('<', '&lt;')
                   .replace('"', '&quot;'))

def as_html(something):
    "Render an object for inclusion in HTML, using its as_html if possible."
    if isinstance(something, type([])): return ''.join(map(as_html, something))
    try: return something.as_html()
    except AttributeError: return escape(str(something))

class html_element:
    "A class for producing HTML elements."
    def __init__(self, name, attrs=None, contents=None):
        self.name = name
        self.attrs = attrs or {}
        self.contents = contents
    def as_html(self):
        tagstr = self.name + ''.join([' %s="%s"' % (key, as_html(val)) 
                                      for key, val in self.attrs.items()])
        if self.contents is None:
            return '<%s />' % tagstr
        else:
            return '<%s>%s</%s>' % (tagstr, self.contents, self.name)
    def __str__(self): return self.as_html()
    def content_transform(self, content): return as_html(content)
    def __getitem__(self, contents):
        if contents is None: return self.clone(contents=None)
        if not isinstance(contents, type(())): contents = (contents,)
        return self.clone(contents=''.join(map(self.content_transform, 
                                               contents)))
    def __call__(self, **attrs):
        nattrs = {}
        for k, v in attrs.items():
            if k.startswith('_'): k = k[1:]
            nattrs[k] = v
        return self.clone(attrs=nattrs)
    def clone(self, attrs=None, contents=None):
        if not attrs: attrs = {}
        attrs.update(self.attrs)
        return self.__class__(self.name, attrs, contents)

class cdata_content_html_element(html_element):
    "A subclass for HTML elements like <script> with CDATA content model."
    def content_transform(self, content): return str(content)

# Import a bunch of HTML into the global namespace
for tag in 'table tr td th head title body p span div h1 h2 a img style'.split():
    globals()[tag] = html_element(tag)
script = cdata_content_html_element('script')

class _html:
    "An easy way to get html_element instances: e.g. print html.div."
    def __getattr__(self, name): return html_element(name)
html = _html()

# I found some images I could use like this:
# find ~/ -name '*.png' -size -2048c | while read fname; do echo '<img src="'"$fname"'" />'; done > images.html

delete_image = "images/table-remove-column.gif"

def maketable(afile, munge = lambda x: None):
    "Parse a file with rfc822parse and return an HTML table of its contents."
    records = list(rfc822parse(afile))
    allkeys = {}
    for rec in records: munge(rec)
    for rec in records:
        for key in rec.keys(): allkeys[key] = 1
    keys = allkeys.keys()
    # These three images come from Firefox:
    images = lambda colname: (
        html.nobr[" ",
                  img(src="images/table-add-row-before-active.gif", 
                      _class="sorted_up", 
                      title="sorted ascending by %s" % colname),
                  img(src="images/table-add-row-after-active.gif",
                      _class="sorted_down", 
                      title="sorted descending by %s" % colname),
                  img(src=delete_image,
                      _class="hide_column", title="hide %s" % colname)])
    headers = tr[[th(title="sort by %s" % key)[key, images(key)] 
                  for key in keys]]
    rows = [tr[[td[rec.get(key)] for key in keys], "\n"] for rec in records]
    # Set cellspacing to 0 to avoid lost mouse clicks.
    return table(cellspacing=0, cellpadding=0)[headers, rows]

css = """
* { letter-spacing: 0.08em; font-size: 96% }
th, td { 
    background: url(images/shadowTR.png);  /* from Dojo */
    background-repeat: no-repeat;
    background-position: bottom left;
    padding: 2px;
    text-align: right;
}
th:hover, td:hover { color: #00f }
img.hide_column, img.delete_filter { padding: 2px }
img.hide_column:hover, img.delete_filter:hover { background-color: #f77 }
#menu div, span.pulldown { border: 1px solid #ddf; padding: 2px }
span.pulldown { margin: 2px }
#menu div:hover, span.pulldown:hover { background-color: #ffd }
#menu { position: absolute; left: 200px; top: 200px; display: none }
/* default background-color is transparent, so we use white. */
/* margin-top: -1px is to make the 1px borders not be double-width. */
#menu div { text-align: center; margin-top: -1px; background-color: white }
"""

allcolumns = "All columns shown."
allrows = "All rows shown."

javascript = """

// ======================================== basic functional stuff
function forEach(list, fun) {
    for (var ii = 0; ii < list.length; ii++) fun(list[ii])
}
// Is item an element of list?
function element(item, list) {
    if (list.length == null) throw "attempt to find element in non-list"
    for (var ii = 0; ii < list.length; ii++) if (item == list[ii]) return true
    return false
}
// Return all items for which fun returns true.
function filter(fun, list) {
    var rv = []
    for (var ii = 0; ii < list.length; ii++) 
        if (fun(list[ii])) rv.push(list[ii])
    return rv
}
// Transform items in a list with fun.
function map(fun, list) {
    var rv = []
    for (var ii = 0; ii < list.length; ii++) rv.push(fun(list[ii]))
    return rv
}
// Copy a list-like thing so that it becomes a real list.
function copylist(list) { return filter(function(){return true}, list) }
// Are all the booleans in a boolean list true?
function all(alist) {
    for (var ii = 0; ii < alist.length; ii++)
        if (!alist[ii]) return false
    return true
}
function not_null(value) { return value != null }
// Set several attributes on an object.
function set_attributes(obj, attrs) {
    for (var attr in attrs) obj[attr] = attrs[attr]
}
// Construct a closure to call a specified method.
function method() {  // hmm, this went from 1 line to 5 to handle arguments...
    var args = copylist(arguments)
    var name = args.splice(0, 1)
    return function(obj) { return obj[name].apply(obj, args) } 
}

// ======================================== basic DOM stuff

function remove_node(node) { node.parentNode.removeChild(node) }
function classes(element) {
    return filter(function(x) { return x != ''}, element.className.split(/\s+/))
}
// Not in any particularly nice order.
function descendants(element) {
    var rv = []
    // Using an explicit stack because stupid Firefox was
    // saying "too much recursion".  This comment will be embarrassing
    // if the bug was in my recursive code.
    // function get_descendants(element, list) {
    //     forEach(element.childNodes, function(x) { 
    //         list.push(x)
    //         get_descendants(element, list)
    //     })
    // }
    var stack = [element]
    while (stack.length) {
        var elem = stack.pop()
        rv.push(elem)
        forEach(elem.childNodes, function(x) { stack.push(x) })
    }
    return rv
}
function getDescendantsByClassName(elem, klass) {
    return filter(function(e) { 
        return e.nodeType == e.ELEMENT_NODE && element(klass, classes(e))
    }, descendants(elem))
}
// find an ancestor with a specified tag.
function ancestorOfType(elem, tagname) {
    while (elem && elem.tagName != tagname) elem = elem.parentNode
    return elem
}
function $(id) { return document.getElementById(id) }
function txt(text) { return document.createTextNode(text) }
function unhide(elem) {elem.style.display = ''}
function hide(elem) {elem.style.display = 'none'}

// ======================================== basic utility stuff

function assert(condition, reason) {
    if (!condition) {
        assert_fail_reason = reason
        throw "assertion failure: " + reason
    }
}

// Get numeric value for sorting.
// Note: empty strings are "numeric".  Returns 0 for them, because NaN
// seems to fuck up sorting.
// Also allow leading $ and trailing ? or %%.  (Note percent must be doubled
// inside Python string interpolation.)
function numeric_value_of(val) {
    var match = val.match(/^\$?(\d*(?:\.\d+)?)\??%%?$/)
    if (!match) return null
    var rv = parseFloat(match[1])
    // there must be a better way to detect NaN!
    if ('' + rv == 'NaN') return 0
    return rv
}

// ======================================== Column

// Represents a column of the table.
function Column(table_element, column_number) {
    assert(table_element.tagName == 'TABLE', 'no tbody?')
    this.table = table_element
    this.colnum = column_number
    this.title = this.cells()[0].firstChild.textContent
}
// All the table cell DOM objects in the column.
Column.prototype.cells = function column_cells() {
    var rv = []
    var column_num = this.colnum
    var self = this
    forEach(this.table_rows(), function(row) {
        var cell = self.row_cell(row)
        if (cell) rv.push(cell)
    })
    return rv
}
// All the table cell DOM objects in the column except the header.
Column.prototype.data_cells = function column_data_cells() {
    var rv = this.cells()
    rv.splice(0, 1)
    return rv
}
// All the table row DOM objects in the table.
Column.prototype.table_rows = function column_table_rows() {
    return this.table.getElementsByTagName('tr')
}
// All the table row DOM objects in the table except the headers.
Column.prototype.data_rows = function column_data_rows() {
    var rv = copylist(this.table_rows())
    rv.splice(0, 1)  // skip header row
    return rv
}
// Given a table row DOM object, returns the cell DOM object for this column.
Column.prototype.row_cell = function column_row_cell(row) {
    assert(row.tagName == 'TR', "row")
    var ii = 0
    var cell = row.firstChild
    while (cell != null && ii < this.colnum) {
        assert(cell.tagName == 'TD' || cell.tagName == 'TH', "cell")
        ii++
        cell = cell.nextSibling
    }
    return cell
}
// Describe the column as a textual DOM object for when it's hidden.
Column.prototype.describe_hidden = function describe_hidden() {
    var a = document.createElement("a")
    // XXX hope column name is safe for interpolation...
    a.href = "javascript:unhide_column('" + this.title + "')"
    a.appendChild(txt(this.title))
    return a
}
// Return cell contents as strings.
Column.prototype.values = function column_values() {
    var self = this
    return map(function(cell) { return cell.textContent }, this.data_cells())
}
// Return true if all (data) cell contents are numeric.
Column.prototype.is_numeric = function column_is_numeric() {
    if (this._is_numeric != null) return this._is_numeric
    this._is_numeric = all(map(not_null, map(numeric_value_of, this.values())))
    return this._is_numeric
}
// Return a version of "val" suitable for comparisons, either as string or num.
Column.prototype.data_value = function column_data_value(val) {
    if (this.is_numeric()) return numeric_value_of(val)
    else return val
}
// Return cell contents either as strings or as numbers, for comparisons.
Column.prototype.data_values = function column_data_values() {
    var self = this
    return map(function(val){return self.data_value(val)}, this.values())
}
Column.prototype.readable_value = function column_readable_value(value) {
    if (value == '') return "empty"
    if (this.is_numeric()) return value
    return "'" + value + "'"
}
// Sort the rows in the table.
// Here's yet another JavaScript function that would be simpler with
// NumPy-like array operations.
Column.prototype.sort_rows = function sort_rows(reverse) {
    // Values is a two-column table, rather than two arrays, so
    // that .sort() will sort both columns.
    var data_values = this.data_values()
    var values = []
    // Um, DUH.  I'm already providing a comparator function.  Why
    // don't I just FETCH THE DATA VALUE IN THE COMPARATOR FUNCTION?
    // XXX THIS IS STUPID.  THE SCHWARTZIAN TRANSFORM IS FOR WHEN 
    // FETCHING THE KEY IS EXPENSIVE.
    for (var ii = 0; ii < data_values.length; ii++) {
        values.push([ii, data_values[ii]])
    }

    var up = reverse ? -1 : 1
    values.sort(function(a, b) {
        return (a[1] < b[1]) ? -up : (a[1] > b[1]) ? up
             // Fall back on previous row ordering for equal keys
             : (a[0] < b[0]) ? -1  : (a[0] > b[0]) ? 1 : 0
    })
    // dump original table contents and reinsert sorted
    var rows = this.data_rows()
    assert(rows.length == values.length, 
        "length wrong: " + rows.length + " " + values.length)
    var headers = this.table_rows()[0]
    while (headers.nextSibling) remove_node(headers.nextSibling)
    forEach(values, function(val) {
        assert(val[0] >= 0, "index too low: " + val[0])
        assert(val[0] < rows.length, "index too high: " + val[0])
        headers.parentNode.appendChild(rows[val[0]])
    })
}


// XXX neither find_column nor row_cell handles colspan or rowspan.

// Given a DOM object somewhere within a column, construct a column object.
function find_column(element) {
    var cell = ancestorOfType(element, 'TH') || ancestorOfType(element, 'TD')
    var row = cell.parentNode
    assert(row.tagName == 'TR', 'row tagname: ' + row.tagName)
    var colnum = 0
    var seeker = row.firstChild
    while (seeker != cell) {
        assert(seeker.tagName == 'TD' || seeker.tagName == 'TH', 'cell tag')
        colnum++
        seeker = seeker.nextSibling
    }
    return new Column(row.parentNode.parentNode, colnum)
}

// ======================================== common but less general DOM stuff

// Append a bunch of textual DOM nodes to an element to form a textual list.
// Inserts commas and " and " as appropriate.
function append_comma_separated_list(elem, items) {
    for (var ii = 0; ii < items.length; ii++) {
        elem.appendChild(items[ii])
        if (items.length > 2 && ii < items.length - 1)
            elem.appendChild(txt(", "))
        else if (ii < items.length - 1) elem.appendChild(txt(" "))
        if (ii == items.length - 2) elem.appendChild(txt("and "))
    }
}

// Wrap a DOM event handler in a delaying function --- provides some
// instant feedback to the user that their click registered, then runs
// the "bottom-half" handler from a setTimeout of 0.  There's an
// optional top_half to do anything that needs to be done to the event
// immediately (e.g. preventDefault, or save its currentTarget.)

function delay_dom_handler(bottom_half, top_half) {
    return function(ev) {
        var original_background = ev.target.style.background
        ev.target.style.background = 'red'
        var top_half_rv
        if (top_half) top_half_rv = top_half(ev)
        var bottom_half_wrapped = function() {
            ev.target.style.background = original_background
            bottom_half(ev, top_half_rv)
        }
        setTimeout(bottom_half_wrapped, 0)
    }
}

// ======================================== hiding columns

hidden_columns = []

// Update the text describing the hidden columns.
function update_hidden_columns_text() {
    var p = $('hidden_columns')
    if (hidden_columns.length == 0) {
        p.innerHTML = "%(allcolumns)s"
    } else {
        p.innerHTML = "Showing all columns except for "
        append_comma_separated_list(p, 
            map(method('describe_hidden'), hidden_columns))
        p.appendChild(txt("."))
    }
}

// Called both from javascript: urls and from code, thus the funny arg.
function unhide_column(hidden_column_name) {
    var idx = null
    for (var ii = 0; ii < hidden_columns.length; ii++) {
        if (hidden_columns[ii].title == hidden_column_name) { idx=ii; break }
    }
    if (idx == null) return
    var column = hidden_columns.splice(idx, 1)
    update_hidden_columns_text()
    forEach(column[0].cells(), unhide)
}

function hide_column(column) {
    if (element(column, hidden_columns)) return
    hidden_columns.push(column)
    update_hidden_columns_text()
    forEach(column.cells(), hide)
}

// when the user clicks on the 'hide column' button
function _hide_column_handler(ev) {
    hide_column(find_column(ev.target))
}
hide_column_handler = delay_dom_handler(_hide_column_handler, 
    method('stopPropagation'))

// ======================================== sorting

// Keep track of the current image indicating "sorted by" so we can hide it
sorted_by_img = null
function set_sorted_by_img(new_img) {
    if (sorted_by_img) hide(sorted_by_img)
    sorted_by_img = new_img
    unhide(sorted_by_img)
}

// when the user clicks on a column header, sort by that column
sort_column_hdr = null
sort_order = null
function _sort_column(ev, th) {
    assert(th.tagName == 'TH', th.tagName)
    // I almost always want things sorted descending, so that's the default.
    sort_order = (sort_column_hdr == th && sort_order == 'sorted_down')
        ? 'sorted_up'
        : 'sorted_down'
    sort_column_hdr = th
    forEach(getDescendantsByClassName(th, sort_order), set_sorted_by_img)
    find_column(th).sort_rows(sort_order == 'sorted_down')
}
sort_column = delay_dom_handler(_sort_column, 
    function(ev) { return ev.currentTarget })

// ======================================== filtering

// ways to compare
comparison_functions = {
    ' is ': function(n, a, b) { return a == b },
    ' is empty or ': function(n, a, b) { return n || a == b },
    ' is at least ': function(n, a, b) { return a >= b },
    ' is at most ': function(n, a, b) { return a <= b },
}

// Filter represents a single filtering criterion.
function Filter(column, value) {
    this.column = column
    this.value = value
    this.relationship = ' is '
}
// Construct DOM textual description of the filter.
Filter.prototype.description = function filter_description() {
    var rv = document.createElement('span')

    var delete_criterion_button = document.createElement('img')
    set_attributes(delete_criterion_button, {
        src: "%(delete_image)s",
        title: "remove this filter",
        className: "delete_filter",
    })
    delete_criterion_button.addEventListener('click', unfilter(this), false)
    rv.appendChild(delete_criterion_button)

    rv.appendChild(txt(this.column.title))

    var span = document.createElement('span')
    set_attributes(span, {
        className: "pulldown",
        title: "change the relationship being tested",
    })
    span.appendChild(txt(this.relationship))
    span.addEventListener('mousedown', pop_up_menu(this, span), true)
    rv.appendChild(span)

    rv.appendChild(txt(this.column.readable_value(this.value)))
    return rv
}
// Change the kind of criterion --- how we compare sample value to table data
Filter.prototype.change_relationship = function filter_change_relationship(r) {
    // If we're changing between 'is' and more complicated relationships,
    // we may want to hide or unhide the relevant column.
    var columnhiding_relationship = ' is '
    var hide_unhide = ((this.relationship == columnhiding_relationship)
                                    != (r == columnhiding_relationship))
    this.relationship = r
    if (hide_unhide) {
        if (r == columnhiding_relationship) hide_column(this.column)
        else unhide_column(this.column.title)
    }
}
// Return true if this filter would allow a row to be shown.
Filter.prototype.matches = function filter_matches(row) {
    var cell_contents = this.column.row_cell(row).textContent
    var is_null = cell_contents == ''
    return comparison_functions[this.relationship](
        is_null,
        this.column.data_value(cell_contents),
        this.column.data_value(this.value)
    )
}
Filter.prototype.equals = function filter_equals(other) {
    return this.column == other.column && this.value == other.value
}

filters = []

// Filter list updated; update textual description and displayed rows.
function update_filter_display() {
    // Update textual description of filter list.
    if (filters.length == 0) {
        $('row_filter').innerHTML = "%(allrows)s"
    } else {
        var p = $('row_filter')
        p.innerHTML = 'Showing only rows where '
        var filter_descriptions = map(method('description'), filters)
        append_comma_separated_list(p, filter_descriptions)
        p.appendChild(txt("."))
    }

    // Show all those rows that make it through the filters; hide the rest.
    forEach(all_data_rows, function(row) {
        (all(map(method('matches', row), filters)) ? unhide : hide)(row)
    })
}

filter_using_menu = null
function _menu_mouseup(ev) {
    $('menu').style.display = ''  // default back to 'none' from the CSS
    document.removeEventListener('mouseup', menu_mouseup, true)
    if (element(ev.target, $('menu').childNodes)) {
        filter_using_menu.change_relationship(ev.target.textContent)
        update_filter_display()
    }
    filter_using_menu = null
}
menu_mouseup = delay_dom_handler(_menu_mouseup, function(ev) {
    ev.stopPropagation()
    ev.preventDefault()
})

function pop_up_menu(filter, element) {
    return delay_dom_handler(function pop_up_handler(ev) {
        document.addEventListener('mouseup', menu_mouseup, true)
        filter_using_menu = filter
        $('menu').style.display = 'block'
        $('menu').style.left = element.offsetLeft + 'px'
        // XXX 1 is a stupid fudge factor.  We subtract
        // element.offsetHeight to make the second menu line align.
        $('menu').style.top = (element.offsetTop - element.offsetHeight + 1) + 'px'
        // XXX why doesn't this work?!  I need to go back to Remedial CSS!
        element.style.width = $('menu').offsetWidth + 'px !important'
    }, method('preventDefault'))
}

// Construct an event handler for a filter criterion deletion button.
function unfilter(filter) {
    // Delete a filter criterion when the user clicks on the button to do so.
    return delay_dom_handler(function unfilter_handler(ev) {
        var filter_index
        for (var ii = 0; ii < filters.length; ii++) {
            if (filters[ii].equals(filter)) {
                filter_index = ii
                break
            }
        }
        var deleted_filters = filters.splice(filter_index, 1)
        update_filter_display()
        unhide_column(deleted_filters[0].column.title)
    })
}

// what to do when the user clicks on a table cell
// XXX this variable is a kludge.  the hope is that it will be
// initialized by the time someone calls update_filter_display.
// Better to initialize it on page load.
all_data_rows = null
function _filter_by_example(ev) {
    if (ev.target.tagName == 'A' && ev.target.href) return
    var cell = ancestorOfType(ev.target, 'TD')
    var column = find_column(cell)
    filters.push(new Filter(column, cell.textContent))
    if (!all_data_rows) all_data_rows = column.data_rows()
    update_filter_display()
    hide_column(column)
}
filter_by_example = delay_dom_handler(_filter_by_example)

// ======================================== application setup

// what to do on load
function myloader(ev) {
    forEach(document.getElementsByTagName('TH'), function(elem) {
        elem.addEventListener('click', sort_column, false)
        forEach(getDescendantsByClassName(elem, 'sorted_up'), function(e) {
            e.style.display = 'none'
        })
        forEach(getDescendantsByClassName(elem, 'sorted_down'), function(e) {
            e.style.display = 'none'
        })
        forEach(getDescendantsByClassName(elem, 'hide_column'), function(e) {
            e.addEventListener('click', hide_column_handler, false)
        })
        var col = find_column(elem)
        forEach(col.data_cells(), function(cell) {
            cell.addEventListener('click', filter_by_example, false)
            cell.title = "filter by " + col.readable_value(cell.textContent) + " in " + col.title
            forEach(cell.getElementsByTagName('a'), function(link) {
                /* don't display "filter by link in url" on link mouseover */
                if (!link.href) return
                if (link.title) return
                link.title = link.href
            })
        })
    })
}
window.addEventListener('load', myloader, true)
""" % { 
    'allcolumns': allcolumns, 
    'allrows': allrows,
    'delete_image': delete_image,
}

def makehtml(afile, munge=lambda x: None):
    "Make an HTML file from a bunch of RFC-822 headers."
    menu = div(id="menu")[div[" is empty or "],
                          div[" is "], 
                          div[" is at least "], 
                          div[" is at most "]]
    return html.html[head[title["laptoptable table for %s" % afile], "\n",
                          style(type="text/css")[css], "\n",
                          script(type="text/javascript")[javascript]], "\n",
                     body[h1[afile], "\n", menu,
                          p[span(id="row_filter")[allrows], "\n",
                            span(id="hidden_columns")[allcolumns]], "\n",
                          maketable(afile, munge=munge)]]

def laptop_stuff(rec):
    "Stuff specific to my laptop application."
    exchange_rage = 3.12
    currency = lambda x: '$%.2f' % float(x)
    if rec.has_key('dollars') and not rec.has_key('pesos'): 
        rec['pesos'] = float(rec['dollars']) * exchange_rage
    elif rec.has_key('pesos') and not rec.has_key('dollars'):
        rec['dollars'] = float(rec['pesos']) / exchange_rage
    if rec.has_key('dollars'):
        rec['pesos'] = currency(rec['pesos'])
        rec['dollars'] = currency(rec['dollars'])
    if rec.has_key('url'):
        rec['url'] = a(href=rec['url'])['link']

sample_laptops_file = """
url: http://articulo.mercadolibre.com.ar/MLA-26314670-ibm-thinkpad-600x-con-dvdcd-e-infrarojo-_JM
ram-mb: 192
disk-gb: 12
clock-mhz: 500
battery: broken
pesos: 1490
kilograms: 1.95

url: http://articulo.mercadolibre.com.ar/MLA-26161714-compaq-presario-1200-colegio-_JM
pesos: 1500
ram-mb: 188
disk-gb: 6
battery: working

url: http://articulo.mercadolibre.com.ar/MLA-26310670-_JM
pesos: 1500
ram-mb: 192
disk-gb: 30
clock-mhz: 650
video: 1024x768
battery: working

url: http://articulo.mercadolibre.com.ar/MLA-26336437-notebook-ibm-pentium-ll-64mg-ram-_JM
pesos: 1500
ram-mb: 64
disk-gb: 6
clock-mhz: 400

url: http://articulo.mercadolibre.com.ar/MLA-26257874-notebook-compaq-presario-700la-_JM
pesos: 1500
ram-mb: 256
disk-gb: 20
clock-mhz: 900
battery: semi-working
video: 1024x768
vendor: since 2003, no points

url: http://articulo.mercadolibre.com.ar/MLA-26061407-acer-travelmatte-508t-celeron-500-64mb-11gb-eze-_JM
pesos: 1500
ram-mb: 64
disk-gb: 11
clock-mhz: 500
video: 800x600
battery: 1 hour

url: http://articulo.mercadolibre.com.ar/MLA-25337380-toshiba-satellite-2800-s201-_JM
dollars: 499
photo: none
ram-mb: 128
disk-gb: 30
clock-mhz: 650
stock: out

url: http://articulo.mercadolibre.com.ar/MLA-25836885-notebook-p3-700-mhz-256-ram-40gb-cddvd-envio-gratis-_JM
dollars: 499.98
ram-mb: 256
disk-gb: 40
clock-mhz: 650
battery: working
guarantee: 5 days
video: 1024x768

url: http://articulo.mercadolibre.com.ar/MLA-25380508-ibm-thinpad-t21-3-meses-de-garanta-_JM
dollars: 630
clock-mhz: 900

url: http://articulo.mercadolibre.com.ar/MLA-26297077-notebook-dell-latitude-cpxj650ctgarantia24-meses-_JM
dollars: 625
clock-mhz: 600
disk-gb: 12.5
ram-mb: 128
video: 1024x768
battery: 4 hours
guarantee: 24 months

url: http://articulo.mercadolibre.com.ar/MLA-25662223-dell-pentium-3-1000256-20gb-impecalk-1990-_JM
pesos: 1990
clock-mhz: 1000
ram-mb: 256
disk-gb: 20
battery: 2 hours

url: http://articulo.mercadolibre.com.ar/MLA-26229000-_JM
pesos: 1500
clock-mhz: 33?
ram-mb: 12
video: 800x600

url: http://articulo.mercadolibre.com.ar/MLA-26320681-notebook-toshiba-satellite-4600-pro-liquido-ya-_JM
pesos: 1450
clock-mhz: 800
ram-mb: 128
disk-gb: 18.6
video: 1024x768
battery: 2 hours

url: http://articulo.mercadolibre.com.ar/MLA-25950089-_JM
pesos: 1350
clock-mhz: 700
ram-mb: 128
disk-gb: 12
battery: dead

url: http://articulo.mercadolibre.com.ar/MLA-26202298-laptop-compaq-presario-700-1300-_JM
pesos: 1150
clock-mhz: 500
ram-mb: 256
disk-gb: 20
battery: 0.25 hours
vendor: no points

url: http://articulo.mercadolibre.com.ar/MLA-26083337-notebook-toshiba-tecra-8000-pentium-ii-450-mhs-o-permuto-_JM
pesos: 1290
clock-mhz: 450
ram-mb: 64
disk-gb: 3.8
battery: baja
model: Tecra 8000

url: http://articulo.mercadolibre.com.ar/MLA-26366998-compaq-armada-e500-en-muy-buen-estado-_JM
pesos: 1250
clock-mhz: 700
ram-mb: 256
disk-gb: 12
battery: broken

url: http://articulo.mercadolibre.com.ar/MLA-26228657-ibm-thinkpad-i-series-1200-en-caja-batera-nueva-_JM
dollars: 389.99
clock-mhz: 500
ram-mb: 160
disk-gb: 6
battery: 2 hours
guarantee: 3 meses

url: http://articulo.mercadolibre.com.ar/MLA-26301698-compaq-presario-1200-_JM
pesos: 1200
ram-mb: 60
disk-gb: 6.1

url: http://articulo.mercadolibre.com.ar/MLA-26328755-_JM
pesos: 1200
clock-mhz: 500
ram-mb: 256
disk-gb: 10
battery: broken

url: http://articulo.mercadolibre.com.ar/MLA-26342257-_JM
pesos: 1880
clock-mhz: 900
ram-mb: 128
disk-gb: 20
battery: broken
guarantee: 3 months
vendor: has 5% negatives on 312 points

url: http://articulo.mercadolibre.com.ar/MLA-25931352-notebook-amd-duron-950mhz-256ram-20gb-dvdcd-envio-gratis-_JM
dollars: 564.98
clock-mhz: 950
ram-mb: 256
disk-gb: 20
battery: 2 hours

url: http://articulo.mercadolibre.com.ar/MLA-25819694-notebook-p3-1ghz-256ram-20gb-cd-rw-envio-gratis-_JM
dollars: 549.98
clock-mhz: 1000
ram-mb: 256
disk-gb: 20
battery: 2 hours

url: http://articulo.mercadolibre.com.ar/MLA-25819756-notebook-p3-700mhz-256-ram-30gb-dvd-envio-gratis-_JM
dollars: 539.98
clock-mhz: 700
ram-mb: 256
disk-gb: 30
battery: 2 hours

url: http://articulo.mercadolibre.com.ar/MLA-26051957-_JM
dollars: 530
clock-mhz: 800
ram-mb: 128
disk-gb: 20
battery: 2 hours
video: 1024x768?
model: T21

url: http://articulo.mercadolibre.com.ar/MLA-26328242-_JM
pesos: 1600
clock-mhz: 650
ram-mb: 512

url: http://articulo.mercadolibre.com.ar/MLA-26387632-notebook-pentium-iii-nec-400-mhz-wireless-super-delgada-_JM
pesos: 1550
clock-mhz: 400
ram-mb: 192
disk-gb: 20
battery: 4 hours
video: 1024x768

url: http://articulo.mercadolibre.com.ar/MLA-26227760-notebook-compaq-armada-e500-iii900-256-ram-hd-20-g-143-_JM
dollars: 499.99
clock-mhz: 900
ram-mb: 256
disk-gb: 20
video: 1280x1024

"""

if __name__ == "__main__":
    if len(sys.argv) == 1: infile = StringIO.StringIO(sample_laptops_file)
    else: infile = file(sys.argv[1])
    print makehtml(infile, munge=laptop_stuff)

Thu, 26 Apr 2007

I originally wrote this as part of a kragen-journal post in 2001; this
is a lightly edited version of part of
http://lists.canonical.org/pipermail/kragen-journal/2001-February/000475.html

I've said several times that I prefer low-tech stuff, much to the
shock of some of my co-workers; I'm not sure how to explain this
adequately.

The more I work with computers, the more I realize that complexity has
hidden costs --- complex things are less reliable, fail in more
unpredictable ways, are harder to diagnose problems in, are often
harder to fix, are usually more trouble to keep running, are harder to
change (especially to change without breaking), and are harder to get
to work with other things.  These alone are significant reasons to
prefer a simpler solution to a more complex one unless the complex
solution has significant advantages.

But there's a nastier side to it, too.  Complex things hide the
intentions of their makers better.  There's an interesting power
relationship set up between a craftsman and the users of his work
products: the work products are subject to the whim of the users, but
also, the users are subject to the whims of the work products.

This is a given in architecture: our mental state and behavior are
powerfully affected by our surroundings; by manipulating people's
surroundings, we can control what is easy and hard for them to do, and
therefore, what they will do.

So when you're interacting with an artifact, your control over your
environment is directly limited by the extent of your control over
that artifact.  To the extent that you don't control the artifact, but
are controlled by it, you are under the control of the artifact's
creator.

For example, I watched my ex-girlfriend Jamie deleting her spam one day.
She used Outlook Express, which displayed each message as soon as she
highlighted it in the message index.  Because nearly all of her spam was
HTML spam, it transmitted back signals to its senders (via "web bugs")
that indicate that she had read it, and therefore that her address was a
valid, deliverable address --- simply because she had highlighted it in
order to delete it.

I suggested to her that she should not view the spam before deleting
it.  Unfortunately, neither of us knew how to tell Outlook Express not
to view a message when it was highlighted in the index, nor how to
delete a single message without highlighting it; our ignorance made us
powerless and subjugated her to the designs of the Microsoft engineers
who wrote the program --- and, by extension, to the mentally crippled
morality-disabled vomit-lapping spammers who exploit its careless
intrusion on her privacy.

This is a fairly innocent example; the only negative consequence is
that her email address became less useful as time goes on, making it
impossible for her to maintain very-infrequent email correspondences
as an inadvertent result of a UI design choice.  Not every harmful
design choice is so innocent --- case in point: the version-to-version
minor incompatibilities of Microsoft Office programs, which force you
to upgrade your whole office, and which eventually render old
documents unreadable --- and not every harmful design choice has such
limited effects.

As a result of many experiences like this, I do my best to use
software that is as simple as possible.  Maybe I'm too paranoid.  Or
maybe I'm just sensible, while many other people are blinded by the
coolness of the stuff they're using.  Only time and comp.risks will
tell.

I'd really like to work in an operating environment simple enough that
I could actually read and understand every line of code, and flexible
enough that I could easily change whatever I didn't like.  It seems
that it should be possible to build a mailreader, web browser, HTML
editor, web server, GUI, and preemptively-multitasking OS, all in
100 000 lines of code or so.  Viewpoints Research Institute has just
gotten NSF funding to try to do this in 20 000 lines of code, but as I
understand it, they are going to skip the web-browser part because they
don't like the architecture of the web.

I argued once that GUI programming is inherently harder than text-mode
programming.  Derek Robinson disagreed with me; he said that
programming with the DOM in MSIE5, which one could certainly conceive
as a kind of GUI programming, was far easier than programming with any
GUI toolkit, or even in text mode, than anything he'd ever done
before.  I think he's right.

Mon, 23 Apr 2007

I've just been importing the change history for the Bicicleta project
(stored as a series of .tar.gz source tree snapshots, stone-age-style)
into darcs.  Often I've claimed that darcs is nice because it keeps
the user-interface excise to a minimum, compared to other
source-control systems; this is a sort of natural experience for how
small that excise really is, since I'm currently doing almost nothing
but dealing with darcs (and tar).

I've just recorded 36 changesets in 82 minutes, so the average
inter-changeset interval has been about 2.3 minutes, about 140
seconds.  This is on a project with around 1000 lines of code as of
the last changeset; the changesets I've currently committed represent
about seven nights of work over two weeks.

This 140-second excise means that darcs makes it practical to record
changesets for work units as small as half an hour.  It looks like
most of the changesets I'm currently recording represent about an hour
of work.

Some of those 140 seconds are consumed by navigating and extracting
the tar.gz snapshots, so darcs by itself is even more convenient.

Darcs rocks.

(P.S. some time after writing the above, I finished all of this
importation work, with a total of 80 changesets.  I'll push them out
soon.)

Sun, 22 Apr 2007

* A. Pagaltzis <pagaltzis at gmx.de> [2007-04-22 03:50]:
>     TMPDIR=`mktemp -d`
>     mkdir "$TMPDIR"

Whoops, `mktemp` creates the temporary directory itself, no
`mkdir` necessary.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Sat, 21 Apr 2007

* Kragen Javier Sitaker <kragen at pobox.com> [2007-04-21 09:40]:
> #!/bin/sh
> set -e
> usage="Usage: $0 checkpoint.tar.gz"
> : ${1?$usage}
> mkdir old
> trap 'rm -rf old' 0
> (cd old; tar xzf ../"$1")
> dir="$(ls old | head -1)"
> diff -urN old/"$dir" "$dir"

You can use mktemp(1) to make this robust. Also note that the $()
syntax for command substitution is a bash innovation, so the
/bin/sh shebang is wrong. Lastly, GNU and BSD tar have a handy -C
switch.

    #!/bin/bash
    set -e
    usage="Usage: $0 checkpoint.tar.gz"
    : ${1?$usage}
    TMPDIR=`mktemp -d`
    mkdir "$TMPDIR"
    trap 'rm -rf "$TMPDIR"' 0
    tar xzf "$1" -C "$TMPDIR"
    dir="$(ls "$TMPDIR" | head -1)"
    diff -urN "$TMPDIR/$dir" "$dir"

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

So I'm using Beatrice's Mac, which doesn't have darcs.  It does have
CVS, but after using darcs, CVS feels clumsy.  So I wrote this shell
script to diff a directory from a tarballed checkpoint of it.  Don't
use it if you have directories sitting around called 'old'.

#!/bin/sh
set -e
usage="Usage: $0 checkpoint.tar.gz"
: ${1?$usage}
mkdir old
trap 'rm -rf old' 0
(cd old; tar xzf ../"$1")
dir="$(ls old | head -1)"
diff -urN old/"$dir" "$dir"

Thu, 19 Apr 2007

I'm pleasantly surprised to see some promise in Bicicleta's current
concrete syntax.

A call to collect
-----------------

So consider this call to prog.collect:

    prog.collect(f: "(" + f.item + ")", for_each="cthulhu", 
        where=(f.item == "u").not)

This returns the list ["(c)", "(t)", "(h)", "(l)", "(h)"].

Having just explained this code to several people, I am now aware that
it is not a marvel of clarity, so I will start by explaining what this
does, and some of its internal workings.

It makes a list of "(" plus a character plus ")", for each character
of the string "cthulhu", as long as the character is not "u".

There are three arguments to 'prog.collect': 'arg1', which is "(" +
f.item + ")"; 'for_each', which is "cthulhu"; and 'where', which is
(f.item == "u").not.  'f' is a name used to refer to the collect
expression as a whole.

It's a little odd to have "arguments" whose value depends on the
function they're being passed to, so I should explain that they're not
really arguments, but methods.  Bicicleta doesn't really have
arguments.

This expression is evaluated by evaluating prog.collect (the results
of the 'collect' method on the variable 'prog', which conventionally
refers to the top level of the program), deriving a new object from it
by overriding the 'arg1', 'for_each', and 'where' methods described
above, and then calling the '()' method on the resulting object.  If
we didn't want to do this last step, we could write

    prog.collect {f: "(" + f.item + ")", for_each="cthulhu", 
        where=(f.item == "u").not }

which is just the object (the same object named by f).

Prog.collect evaluates 'arg1' and 'where' in a series of objects with
different 'item' methods, which return successive elements of the
'for_each' value, in order to construct the list.  'for_each' can be
any kind of sequence, not just a string.  

Mechanics of 'collect'
----------------------

Here's the full code for prog.collect:

    # Collect: map+filter, in a more listcompy shape.
    # WORDY! Uck!  Avoids prog.if because prog.if depends on collect.
    collect = {collect: arg1 = collect.item,
        cursor = collect.for_each.cursor
        item = collect.cursor.item
        where = prog.sys.bool.true
        next = collect { cursor = collect.cursor.advanced }
        '()' = collect.cursor.empty.if_true(
            then = collect.cursor
            else = collect.where.if_true(
                then = collect.arg1 @ collect.next()
                else = collect.next()))
    }

'arg1' defaults to collect.item (the same as f.item in the call
earlier); 'item' is defined as collect.cursor.item; 'cursor' defaults
to collect.for_each.cursor; 'where' defaults to true; 'next' is a
method that returns the same object, except with a new value for
'cursor' (giving it different values for 'item', 'arg1', 'result', and
maybe 'where'); and '()' either returns an empty list (if 'cursor' is
empty) or either collect.next() or collect.arg1 @ collect.next(),
depending on whether 'where' is true or false.  '@' is the cons or
list-construction operator.

So, in the call above, initially 'item' is "c", 'arg1' is overridden
to be "(c)", 'where' is overridden to be true, and the cursor is not
empty, so we end up with '()' returning "(c)" @ collect.next().  In
collect.next, the cursor is advanced to point to the next item, 'item'
is "t", arg1 evaluates to "(t)", and '()' evaluates to "(t)" @
collect.next(), so the top-level '()' evaluates to "(c)" @ ("(t)" @
some other stuff), and so it goes on.

In the case where item is "u", because cursor is pointing to the
beginning of "ulhu", 'where' evaluates to 'false', so the
collect.where.if_true expression returns collect.next(), ignoring the
"(u)" that 'arg1' would compute.

Eventually, the cursor is empty, and '()' just returns that (empty)
cursor, which serves to terminate the list; probably I should return
prog.sys.nil instead.  In those cases, it doesn't matter what 'item'
and 'arg1' evaluate to, even if they evaluate to errors, because
they're not being returned.  Likewise in cases where 'where' evaluates
to false --- '()' just returns collect.next() and bypasses 'result'
and 'arg1' entirely.

I anticipate that utilities like "collect" will be able to keep
explicit recursion confined to tiny corners of the system libraries
and to problems that really benefit from recursion.

Why I Think This is Cool (Bicicleta, Python, OCaml, and Squeak)
---------------------------------------------------------------

Loops are confusing and complicated, especially in functional
languages that implement them by recursion.  A lot of loops can be
subsumed by simple one-variable list-comprehensions, often with
improved comprehensibility and brevity.

For this reason, Python and Haskell have list-comprehension syntax
built into the language, so that you can write (in Python):

    ["(" + item + ")" for item in "cthulhu" if item != "u"]

Which gives you the same result as the Bicicleta expression:

    prog.collect(f: "(" + f.item + ")", for_each="cthulhu", 
        where=(f.item == "u").not)

(The .not is just because I haven't implemented != for strings yet,
because right now my !=-derived-from-== magic is locked up in a
numeric class from which I should factor out a "comparable".)

To my eyes, the Python version is more readable, but the difference is
not enormous; they are closer to one another than either is to

    rv = []
    for item in "cthulhu":
        if item != "u": rv.append("(" + item + ")")
    # now do something with rv

If I added special syntax to Bicicleta to do list-comprehensions, I
coule eliminate the "prog.collect" part:

    [f: "(" + f.item + ")", for_each="cthulhu", where=(f.item == "u").not]

But even without special syntax, I think it's better already than
Smalltalk:

    'cthulhu' asArray select: [:c | c ~= $u] 
        thenCollect: [:c | '(', c asString, ')']

Or OCaml:

    let list_of_string string = 
      let rv = ref [] in 
        for i = String.length string - 1 downto 0 do 
          rv := string.[i] :: !rv 
        done ; 
        !rv
    in
    List.map (fun item -> "(" ^ String.make 1 item ^ ")") 
      (List.filter ((<>) 'u')
        (list_of_string "cthulhu")) ;;

Although, to be fair, a lot of the verbosity in the Smalltalk and
OCaml versions has to do with excessive incompatible types (lists,
strings, arrays, characters) rather than the clumsiness of the
non-list-comprehension syntax.  But consider in the ideal case, where
those incompatibilities don't exist:

    ["(" + item + ")" for item in "cthulhu" if item != "u"]
    'cthulhu' select: [:c | c ~= $u] thenCollect: [:c | '(', c, ')']
    prog.collect(f: "(" + f.item + ")", for_each="cthulhu", 
        where=f.item != "u")
    List.map (fun item -> "(" ^ item ^ ")") 
      (List.filter ((<>) 'u') "cthulhu") ;;

With corresponding bits rearranged to more or less line up:

             "(" + item + ")" for item in "cthulhu"  if item != "u"
    thenCollect: [:c | '(', c, ')']       'cthulhu' select: [:c | c ~= $u] 
       "(" + f.item + ")",       for_each="cthulhu", where=f.item != "u"
   List.map (fun item -> "(" ^ item ^ ")")"cthulhu"  (List.filter ((<>) 'u')

This suggests that there is some brevity benefit that attaches
specifically to the practice of defining new methods in the form
f.item != "u", rather than creating anonymous functions, even such
lightweight functions as Squeak has [:c | c ~= $u].  Currying, such as
((<>) 'u') is even shorter, of course.

It turns out that you can use currying in Bicicleta similarly; you can
write "u".'!=' to mean {op: '()' = "u" != op.arg1}.  In this case,
though, collect is defined to expect a method definition on itself,
not an anonymous function that it would have to pass something to
explicitly.

Sun, 15 Apr 2007

> But in a modern programming language, Hanoi can be even shorter:

I wonder if there's any relation between the three towers being sufficient  
to transfer an arbitrary number of disks, and the ability of a 3-lisp* to  
reflect an arbitrary number of metacircular interpreters?

-Dave

:: :: ::

* (swiped from Oliver Danvy)
On the next planet, in a building, there was a computer room.

The little prince: "Good evening."
The programmer, glued to his computer terminal: "Sh! I'm programming!"
The little prince: "What is programming?"
The programmer, his eyes still on the screen: "Mmmmh ... you're  
interrupting me! What are you doing here?"
The little prince: "What is programming?"
The programmer, blinking and turning his head towards the little prince:  
"It is to make a machine do something.
You write a program specifying what to do and the machine runs it."
The little prince: "So what do you want the machine to do?"
The programmer: "To run my program. I am making a program to run my  
program, meta-circularly."
And the programmer returned to his meta-circular program.
The little prince pondered.
The little prince: "Why do you want a program to run itself since the  
machine already runs it?"
The programmer, looking at him again and frowning: "Ah. You're still  
there. It is because I am building a tower of
computation."
The little prince: "How tall will your tower be?"
The programmer: "Infinite."
The little prince looked up.
The programmer: "But in practice one only needs three levels."
The little prince looked down again.
The programmer, resuming his task with a concentrated appearance: "And I  
am working on it."
"What a strange person. Every time I ask him what he wants, he tells me  
what he is doing," the little prince said to
himself. "So he must always be doing what he wants," he thought continuing  
his journey: "a programmer must be
very happy."

Sat, 14 Apr 2007

I was reflecting on how the Towers of Hanoi problem was important in
my development as a programmer --- my Microsoft-Basic-limited idea of
"programming" was dramatically expanded when I saw a recursive
towers-of-Hanoi program in Pascal that was less than a screenful, and
of course really understanding recursion is a real mind-blower.

But in a modern programming language, Hanoi can be even shorter:

    (* move n discs from a to z (using third peg named x) *)
    let rec hanoi move n a z x = if n = 0 then () else 
      (hanoi move (n-1) a x z ; move n a z ; hanoi move (n-1) x z a)
    let printmove disc a b = print_endline 
	("Move disc " ^ (string_of_int disc) ^ " from " ^ a ^ " to " ^ b) ;;
    hanoi printmove 4 "peg A" "peg C" "peg B"

Fri, 13 Apr 2007

This stuff sounds interesting.  Please see also Virtual Ring Routing by Matthew
Caesar:

http://citeseer.ist.psu.edu/caesar06virtual.html

I'm not sure about the scalability/performance of VRR yet, but it appears to be
a simple hack which achieves decentralized global routability over arbitrary
underlay topologies.

The presentation typically talks about wireless link layer operation and all
sorts of bending over backwards to accomodate BGP tradition and so on, but the
theoretical essence is interesting in its own right.

Regards,

Zooko

> Great stuff, as usual. I wouldn't make all the initial IDs random.
> Reason: you can e.g. set the MAC from a WGS84 GPS position fix
> (about /m^2 Earth surface resolution, IIRC). Ideally, the ID
> should be a real 3d coordinate, of course.

(Eugen, has Kragen seen your grasscoding stuff yet?)

I've been handwaving geometrically*, but haven't yet had the chance to  
attempt coding any of the following.

a/ given a fixed mesh, maybe it'd be easier to just derive coordinates  
 from the edges.  let the minimal nodes in each dimension assign themselves  
0 coordinates, and propagate from there.

b/ on a random mesh, one could establish axes with initial random seeds:  
let the line between the global minimum and global maximum be the first  
axis, oriented appropriately, and choose the orientation of the second  
(perpendicular to the first) based on some comparison between the two  
sides.

now to where it gets really handwavy:

c/ choose random IDs so they should show up on the surface of a sphere (3d  
magnitude 1, say) -- then anneal by looking for violations of the triangle  
inequality.  does this converge to global quasi-gridlike coordinates, or  
does one wind up with a bunch of small gridlike grains that flip and  
reflip at the edges?

d/ a topological approach: establish 1D of addressing on the edge nodes,  
then propagate inwards to get something vaguely polar.  not so good if  
there are interior holes, nor if one wishes to extend to 3d.

-Dave

:: :: ::

* directionality is supposed to be difficult with radio, but I'm assuming  
that a small phased array would allow both a rough estimate of metric  
relation and improved performance during data transfer -- this assumption  
is pretty lousy for mobile applications.

> There's more to the ID encoding, you have to minimize
> the amount of bits flipping for nodes in circular orbits
> (anything moving on Earth surface also applies).

which explains some (the non-temporal aspects, anyway) of your interest in  
Gray coding...

Wed, 11 Apr 2007

> I'm trying to figure out what kind of data structure I should use for
> ordinary ordered collections in Bicicleta: some kind of numerically-
> indexed vector, or a Lisp-style externally-singly-linked list?

How about immutable cords/ropes as in E? They support both reasonably
efficient indexing and structure sharing like cons pairs. I've had it
in mind for quite a while to follow this approach in my next 'real'
language, although so far none has been real enough to bother.

Darius

Mon, 09 Apr 2007

> random_shuffle looks pretty hairy.

random_shuffle, as usually implemented, is based on swaps (establishing a  
random permutation) -- but one can also think of it as sorting with a  
"broken" comparison function which returns random results, so there may be  
a clever alternative to doing a real sort with an appended random key?

> There's a non-clever way to make any sort stable, which is to add a
> secondary sort key that is the original position in the list.  There
> may be a clever way to do a stable mergesort without knowing the list
> lengths in advance, but I don't know it.

The idea of radix-sort is that one can reduce the problem of sorting on a  
wide key to several sorts on narrow keys -- adding a secondary sort key is  
a bit self-defeating for this use.

The brute-force way to preserve stability is to reverse the reversed list  
in between passes, as used here with integers providing lists (and a pair  
of integers providing a tape):
http://en.literateprograms.org/Merge_sort_(dc)

The clever way (as used with actual tape drives) was to reverse the sense  
of comparison on each pass.

> That leaves only the heap and permutation operations.  I really don't
> know about them.

I'd guess heaps and permutations are essentially swap-based, and that  
swaps imply random access.
(permutations are closely related to insertion/selection-sort, which are  
not so far away from heap sort)

Knuth Vol 3, "Sorting and Searching", might be a good reference.  Anything  
that was good for external sorting with tapes ought to be good with  
lists.  (indeed, being good with tapes implies that one should be able to  
find more cache-friendly implementations than the standard  
singly-linked-list)

> A remarkably large share of the uses of array indexing by numerical
> variables I found in both my Python code and my JavaScript code were
> for some variant of the string.join problem --- given an array of N
> strings, connect them with N-1 commas, except when N=0, in which case
> we should connect them with N=0 commas instead of N-1 = -1 commas.

Squint at it properly, and tail-call optimization becomes an instance of  
string.join: instead of "a,b,c," (where x is a jump and ',' the equivalent  
return) we want to generate "a,b,c".  The dual (context preservation when  
building argument lists) follows the same pattern: we only need to  
save/restore contexts if there are at least two overlapping evaluations.

This may be a practical reason for the last couple of decades' interest in  
monads: we prefer to program in an algebraic, tree-like form (code as well  
as data), while the iron prefers to handle linear sequences of code  
operating on contiguous chunks of data (op x state -> state') -- and  
monads provide machinery to go from tree inputs to flat results in a  
theoretically nice manner.

-Dave

Introduction
------------

I'm trying to figure out what kind of data structure I should use for
ordinary ordered collections in Bicicleta: some kind of numerically-
indexed vector, or a Lisp-style externally-singly-linked list?  I feel
that I have to decide this now, because I need to add variadic
functions and introspective slot/property/method listing.

Weighing on the side of numerically-indexed vectors, there are more
efficient implementations possible (I think), they work reasonably
well in Python, JavaScript, and Perl, and they don't require people to
think recursively.

Weighing on the side of Lisp-style lists, they don't require an
additional primitive type, they don't introduce numbers (as indices)
into parts of your program that have nothing to do with arithmetic,
and they might not require people to think recursively.

Both structures are capable of the same set of operations; it's just a
matter of which ones are expensive and which ones are dangerous.

So I thought I'd look to see if one or the other would really cramp
anybody's style.

(In here, where I say "array", it means "vector", and where I say
"list", I might mean either "linked list" or "vector".)

Contents
--------

Introduction
Contents
STL Algorithms
Common Lisp Sequences Dictionary
My Own Programming Style in Python
My Own Programming Style in JavaScript
Efficiency
The Strange Case of string.join
Conclusions

STL Algorithms
--------------

So here's a survey of the 1994/1995 STL algorithms and the iterator
types they use.  The STL contains a comprehensive collection of
generally-useful algorithms, and carefully distinguishes what kind of
data structure traversal each algorithm needs.  Immutable Lisp-style
lists can support "input" and "forward" traversal, but not
"bidirectional" or "randomaccess", and by virtue of being immutable,
they don't support "output" directly --- but generally "output"
iterators can map to a returned list value.

    for_each input
    find input
    find_if input
    adjacent_find forward
    count input
    count_if input
    mismatch input
    equal input
    search forward
    copy input output
    copy_backward bidirectional
    iter_swap forward
    swap_ranges forward
    transform input output
    replace forward
    replace_if forward
    replace_copy input output
    replace_copy_if input output
    fill forward
    fill_n output
    generate forward
    generate_n output
    remove forward
    remove_if forward
    remove_copy input output
    remove_copy_if input output
    unique forward
    unique_copy input output
    reverse bidirectional
    reverse_copy bidirectional output
    rotate forward
    rotate_copy forward output
    random_shuffle randomaccess
    partition bidirectional
    stable_partition bidirectional
    sort randomaccess
    stable_sort randomaccess
    partial_sort randomaccess
    partial_sort_copy input randomaccess
    nth_element randomaccess
    lower_bound forward
    upper_bound forward
    equal_range forward
    binary_search forward
    merge input output
    inplace_merge bidirectional
    includes input
    set_union input output
    set_intersection input output
    set_difference input output
    set_symmetric_difference input output
    push_heap randomaccess
    pop_heap randomaccess
    make_heap randomaccess
    sort_heap randomaccess
    max_element forward
    min_element forward
    lexicographical_compare input
    next_permutation bidirectional
    prev_permutation bidirectional
    accumulate input
    inner_product input
    partial_sum input output
    adjacent_difference input output

So only the following 18 algorithms out of the 64 require
bidirectional or random-access iterators:

    copy_backward reverse reverse_copy random_shuffle partition
    stable_partition sort stable_sort partial_sort partial_sort_copy
    nth_element inplace_merge push_heap pop_heap make_heap sort_heap
    next_permutation prev_permutation

The other 46 algorithms work fine without.  For most of the above 18,
there are also well-known ways to get the desired effect with
side-effect-free data structures.

copy_backward sounds like reverse_copy, but it makes a forward copy
--- it just does it in reverse order so that it works when copying
stuff upwards to an overlapping range.

reverse is very straightforward to perform with linked lists.  I think
it is only done with a bidirectional iterator in STL because there's
no "backwards output iterator", which may be because STL's lists are
doubly-linked, so they have bidirectional iterators.

random_shuffle looks pretty hairy.  Probably best just to assign a
random number to each element and sort them, although that feels like
cheating.

partition (and stable_partition) is straightforward to do with lists,
just less efficient.

As for sort and stable_sort, quicksort and mergesort can easily be
done on lists.  The clever thing about mergesort is that it's normally
a stable sort, which is to say that it doesn't reorder elements that
compare equal, but running mergesort using "forward iterators"
suggests that you should use a different way of dividing up the list
--- say, unshuffling it --- but that doesn't give you a stable sort.
There's a non-clever way to make any sort stable, which is to add a
secondary sort key that is the original position in the list.  There
may be a clever way to do a stable mergesort without knowing the list
lengths in advance, but I don't know it.

partial_sort selects the smallest N elements and sorts them.  A lazy
mergesort should provide this nicely.

nth_element is quickselect, which I think can also be done with lists
without too much difficulty.

inplace_merge can be done with merge; indeed, in STL, merge is also
present.

That leaves only the heap and permutation operations.  I really don't
know about them.

This discussion leaves out the fact that several of the algorithms
(upper_bound, lower_bound, equal_range, binary_search, includes) have
implementations that run faster when they have random-access iterators
available.

Common Lisp Sequences Dictionary
--------------------------------

Common Lisp has a similar, though slightly smaller, set of algorithms
that can work on any kind of sequence, including lists and vectors;
it's known as the Sequences Dictionary.  Here's the list:

    (copy-seq sequence)
    (elt sequence index) setfable
    (fill sequence item &key start end)
    (make-sequence result-type size &key initial-element)
    (subseq sequence start &optional end) setfable
    (map result-type function &rest sequences+)
    (map-into result-sequence function &rest sequences+)
    (reduce function sequence &key key from-end start end initial-value)
    (count item sequence &key from-end start end key test test-not)
    (count-if predicate sequence &key from-end start end key)
    (count-if-not predicate sequence &key from-end start end key)
    (length sequence)
    (reverse sequence)
    (nreverse sequence)
    (sort sequence predicate &key key)
    (stable-sort sequence predicate &key key)
    (find item sequence &key from-end test test-not start end key)
    (find-if predicate sequence &key from-end start end key)
    (find-if-not predicate sequence &key from-end start end key)
    (position item sequence &key from-end test test-not start end key)
    (position-if predicate sequence &key from-end start end key)
    (position-if-not predicate sequence &key from-end start end key)
    (search sequence-1 sequence-2 &key from-end test test-not key 
            start1 start2 end1 end2)
    (mismatch sequence-1 sequence-2 &key from-end test test-not key
            start1 start2 end1 end2)
    (replace sequence-1 sequence-2 &key start1 end1 start2 end2)
    (substitute newitem olditem sequence &key from-end test test-not
            start end count key
    (substitute-if newitem predicate sequence &key from-end 
            start end count key)
    (substitute-if-not newitem predicate sequence &key from-end 
            start end count key)
    (nsubstitute newitem olditem sequence &key from-end test test-not 
            start end count key)
    (nsubstitute-if newitem predicate sequence &key from-end 
            start end count key)
    (nsubstitute-if-not newitem predicate sequence &key from-end 
            start end count key)
    (concatenate result-type &rest sequences)
    (merge result-type sequence-1 sequence-2 predicate &key key)
    (remove item sequence &key from-end test test-not start end count key)
    (remove-if test sequence &key from-end start end count key)
    (remove-if-not test sequence &key from-end start end count key)
    (delete item sequence &key from-end test test-not start end count key)
    (delete-if test sequence &key from-end start end count key)
    (delete-if-not test sequence &key from-end start end count key)
    (remove-duplicates sequence &key from-end test test-not start end key)
    (delete-duplicates sequence &key from-end test test-not start end key)

There's also the Conses Dictionary, which is a much larger collection
of algorithms that apply specifically to conses, mostly singly-linked
lists made of them; many of these are widely-applicable algorithms as
well, like set-difference and mapcan, that just happen to work only on
lists.  I haven't surveyed them because there are so many of them,
although perhaps I should survey them if my objective is to see
whether picking vector as a default ordered collection would cause
problems.

There's an Arrays Dictionary too, but it is much more minimal,
containing things like array-dimensions and simple-vector-p.

Some notes on these, because they are different in some particulars
from the corresponding algorithms in other languages:
- unlike in C++ or Haskell, there is no lazy sequence type in Common
  Lisp.
- in the absence of initial-value, reduce calls its function with zero
  arguments when the sequence is empty.
- map maps in parallel over the sequences, which could express certain
  kinds of vector math more conveniently.  It stops when any of the
  input sequences runs out.
- map-into stops when it reaches the end of its output sequence or any
  of the input sequences.
- setfing subseq only works if the new subsequence is of the right
  length, unlike in Python.
- as in STL, sequence indices count from zero, and ranges designated
  by start-end pairs include the start but not the end.
- many of the functions include a :key argument which extracts the key
  to be compared, tested, etc.  Usually, this is equivalent to
  composing some functional argument with the key, potentially in some
  hairy way (e.g. for sort or merge, #'< becomes 
  #'(lambda (x y) (< (key x) (key y)))) but can be implemented more
  efficiently.

My Own Programming Style in Python
----------------------------------

In Python, Perl, and JavaScript, the "lists" are numerically-indexed
arrays, so if there's anywhere where I'd be prone to lots of numerical
indexing of lists, it should be in these languages.  So here's the
Python I have lying around:

    bicicleta/objcalc.py
    bicicleta/objcalcpp.py
    fib.py
    laptoptable.py
    readbmp.py
    safe_repr.py
    ~/cursmail/cursmail.py
    ~/bloompy/bloom.py
    ~/textindex/textindex.py

One by one:

* bicicleta/objcalc.py, bicicleta/objcalcpp.py

These two are the object-calculus interpreter I posted to kragen-hacks
a couple of years ago.  It's extremely non-numerical.  It accesses
lists with zip, for x in y:, list comprehensions, list appending with
+, string.join, and construction of lists with "list displays" (the
Python name for listing a bunch of expressions inside square brackets,
separated by commas, to make a list of their return values); and
there's one place where it says:

        if not lines[-1]: lines = lines[:-1]

It also looks at strings a bit --- it cares whether their length is 0
and whether their char 0 is in "self.namechars".

With these exceptions --- testing for emptiness, looking at the first
item, looking at the last item, removing the last item --- it gains
nothing from the array nature of Python lists.

* fib.py

Contains no lists, because it's a deliberately stupid implementation
of the fibonacci sequence and is only four lines long.

* laptoptable.py

This program parses a text file full of RFC-822 headers and turns it
into a web page with a table.  It tests the first character of a line
with line[0], strips the first character of a key with k[1:], and
tests len(sys.argv) and extracts the first item with sys.argv[1];
otherwise it accesses lists and strings with string.strip,
string.replace, string.join, map, list comprehensions, string.split,
list displays (sort of).

There's also a fair bit of JavaScript in this file.

* readbmp.py

This program is fairly numerical.  It parses a 24-bit-color BMP file
into an array of arrays of floats, generates another array of random
floats, adds them together element by element, and turns the result
into PostScript.

Accordingly, it has stuff like this:
    (width, height) = struct.unpack('<ll', header[18:26])
    (depth,) = struct.unpack('<h', header[28:30])
    return [[(0.5 - image[ii][jj] * 0.5 + pad[ii][jj]) % 1 
             for jj in range(len(image[0]))]
            for ii in range(len(image))]

But it still mostly constructs the arrays with list.append, which adds
a single value to the end of a list, even though it could know the
size of the array before it starts building it; and it still mostly
iterates over the arrays with "for x in y:" and list comprehensions.
The ii jj code above probably should have been written with map,
which, like Common Lisp's map, can map over two sequences at once:

    pad_pixel = lambda pix, padpix: (0.5 - pix * 0.5 + padpix) % 1
    pad_row = lambda imgrow, padrow: map(pad_pixel, imgrow, padrow)
    return map(pad_row, image, pad)

That would leave the only numerical indexing in the program up in the
BMP-header-parsing code, which really *should* be using numerical
indexing, since the "spec" I was using, Greg Roelofs' /etc/magic entry
for BMP files, was written in terms of numerical byte offsets.

* safe_repr.py

This program generates debugging representations of possibly infinite
Python data structures in several different ways --- but without using
too much time or returning too much data.  So it accesses arrays and
strings with %, list comprehensions, string.join, and the like, and
occasionally list.append and list.pop, but it does frequently truncate
strings to a specific numerical length --- that's in the spec!  This
is in statements like this:

	astr = astr[:self.maxlen - self.curlen]

Also, a couple of parts of it do a lazy version of string.join by
iterating over the indices of a sequence, conditionally appending
commas in between; and there's another part where it adapts a
(functional map over a) native Python sequence to an ad-hoc lazy
iteration protocol made out of tuples and thunks:

    def _safe_repr_5_map(xform, val, ii):
        if ii == len(val): return empty_stream
        else:
            return (xform(val[ii]), lambda: _safe_repr_5_map(xform, val, ii+1))

Those parts would be slightly simpler if they were using an iterator
protocol that allowed them to see when the sequence was almost
exhausted, rather than using integers to iterate over the sequences.

* ~/cursmail/cursmail.py

This is an outdated copy of my mailreader.  It uses numbers to index
lists a lot, in the following cases:

    if ii < len(self.lines): return self.lines[ii]
    self.mbox.seek(self.msgs[-1])
    self.mbox.seek(self.msgs[index])
    return self.mbox.fp, self.msgs[index]
    msgid = fields[0]
    self.set_tags(msgid, fields[1:])
    while isinstance(payload, type([])): payload = payload[0].get_payload()
    front, astr = astr[:width], astr[width:]
    else: return astr[:width]
    msg = msglist[msgnum]
    msgid = message_id(mboxlist[message_index])
    if message_matches(search_terms, mboxlist[newmi]):
    no_terms = [term[1:] for term in terms if term.startswith('-')]
    mbox = file(argv[1])

A lot of this is because I wanted to save memory, so I stored a small
amount of data about a large number of messages in an array, I mean a
list.  Some of the other instances happened because the email.Message
API is gross, because I was doing stuff in curses and therefore cared
about how many characters wide things were (although UTF-8 plays havoc
with that anyway), and the usual dropping-the-first, getting-the-rest,
and getting-the-last kind of stuff.

This program is different from the others because it deals with more
external interfaces (this version imports 9 modules) and because I was
concerned with memory usage.  I should probably switch away from the
integer-indexed list of messages, though, and create a mailbox-pointer
class instead.

* ~/bloompy/bloom.py

This program uses arrays, in a big way!

	    self.filter[bucket] |= (1L << bit)
	    if not self.filter[bucket] & (1L << bit): return 0

There are only those two lines, but they are kind of the core of the
program.  It also does a little light string processing:

	for suffix in ['s', 'ing', 'ed', 'es', 'er', 's', 'ly']:
	    if word.endswith(suffix): yield word[:-len(suffix)]

* ~/textindex/textindex.py

This program is a little larger than the others at 1800 lines, and it
implements an on-disk full-text search engine supporting a couple of
different file types and using an innovative data structure.  It
imports 12 modules and implements generalized Bloom filters, binary
serialization, some minimal data compression, and a bunch of other
stuff.  So I haven't read it as thoroughly as the others.

It keeps an array of document hashes that gets indexed numerically in
order to allow the rest of the code to use small integers to represent
indexed documents.

It uses numerical indexing of strings for some minimal parsing:
    self._headwords_fname = firstline[len(magic):-1]
    rv = tuple(map(urllib.unquote, line[:-1].split(' ')))

Also it uses the old numerical interface to os.stat because I didn't
know any better at the time:
    return os.stat(fname)[stat.ST_SIZE]

Some of the tests use numeric slicing to ignore certain elements of a
tuple:
    ok(reader.find('0032')[0:2], find_rv)

It uses numeric slicing to manage byte buffers that are windows onto a
file:
    start += end
    buf = buf[end:]
    self.buf = self.buf[pos:]

It uses numeric slicing to chop up a SHA-1 hash, represented as a
binary byte string, into smaller hashes:
    return (gethash(bighash[0:4])[0], gethash(bighash[4:8])[0],
            gethash(bighash[8:12])[0])

Of course the generalized Bloom filter uses numerical indexing as
well:
    self.additions[bucketnum] = bucket = [doc_id]

* overall

Perhaps unsurprisingly, I seem to try to avoid numerical indices for
Python lists and strings, and the majority of my uses fall into one of
these easy-to-fix categories:
- extracting the first or second item
- extracting the last item
- dropping the first or last item
- keeping track of the state of a simple forward iteration over the list
- interfacing with some Python module that returns a list with special
  significance attached to each element (sys.argv, os.stat)

However, there are some cases demonstrated here that are slightly
harder to fix:
- parsing binary data structures specified in terms of byte offsets
- truncating strings to a specified length
- implementing a Bloom filter
- managing byte buffers

There's this additional concern: iterating over arrays with numbers is
pretty easy to understand, and it's pretty easy to figure out how to
do stuff.  For example, I wrote the parallel iteration over the lists
of lists by using integer indices because I had forgotten that
Python's "map" could iterate over multiple lists in parallel; it was
good that I could just do it instead of having to look through the
manual for some way to iterate over multiple lists in parallel.

Maybe iterator objects can be just as nice that way, but I'll have to
see it to be sure.

My Own Programming Style in JavaScript
--------------------------------------

I have some bits of JavaScript lying around, too.

    laptoptable.py
    sk.js
    formulasheet/formulasheet.js
    formulasheet/freevars.js
    (didn't bother to analyze the following; they total 871 lines)
    js-calc.js
    slow3d/torus.js

* laptoptable.py

This is nominally a Python program, but most of its contents are a big
JavaScript string.

The first thing the JavaScript contains is a bunch of very short
functions like this:

    function forEach(list, fun) {
        for (var ii = 0; ii < list.length; ii++) fun(list[ii])
    }
    // Is item an element of list?
    function element(item, list) {
        if (list.length == null) throw "attempt to find element in non-list"
        for (var ii = 0; ii < list.length; ii++)
            if (item == list[ii]) return true
        return false
    }

Those are there to avoid numerical indexing in the rest of the code.

There's a place where I drop the first item in an array:

    var name = args.splice(0, 1)

JavaScript regular expression matches return arrays:

    var rv = parseFloat(match[1])

And there are other places where I get the first item in an array, or
drop it:
    this.title = this.cells()[0].firstChild.textContent
    rv.splice(0, 1)

Here's a loop where I want to loop over both indices and their
corresponding values, which is stupid in retrospect:
    for (var ii = 0; ii < data_values.length; ii++) {
        values.push([ii, data_values[ii]])
    }

As you can guess, that array being constructed there gets indexed with
integer constants.  I probably should have written {i: ii, d:
data_values[ii]}.  Except that the whole thing is stupid.  But I do
probably need to be indexing data_values with a column-number
variable.

There's another loop later where I loop over the indices of an array
in order to insert its contents into the document with commas in
between.

And another loop where I return the numerical position in an array of
an object with a certain key --- like (position ... :key ...) --- in
order to be able to delete it from the list later.  This actually
happens in two places.

* sk.js

This is a graph-reduction evaluator for SK-combinators, and as such we
would expect it to be entirely non-numerical in nature.

It depends on MochiKit, so it really has no excuse for indexing arrays
numerically.  But it does.

It has:
- a loop that iterates over two arrays in parallel (by indices)
- a loop that does, effectively, join(' '), but could be written with
  a forEach because it starts with an initial value from outside the array
- a forEach that uses an Array as a stack with shift and unshift, and
  therefore uses stack[0] to get at the top-of-stack;
- a function 'last' that returns list[list.length - 1]
- various accesses to expression arguments by position rather than by
  name.

* formulasheet/formulasheet.js

This is mostly a simple user interface wrapped around eval, with some
funky stuff to simulate reflection on free variables.  It doesn't
depend on MochiKit and has its own forEach function, but also uses
explicit numerical indices in several places for other kinds of
iteration: find (sometimes with a :key), :from-end, skipping the first
element, dropping the first and last characters of a string.

* formulasheet/freevars.js

This is actually the funky stuff to simulate reflection.  It indexes
arrays by number because JavaScript's regexp matches are indexed by
number, but that's it.

* overall

I use numerical indices a lot in JavaScript, probably even when I
should be using MochiKit's iteration system instead.  (But MochiKit's
iteration system is sloow.)  Mostly I use them for kinds of iteration
that are a little more complicated than simple forEach, but I also use
them for getting the last item, interfacing with regular expressions,
deleting items from arrays, and in place of Objects with named
properties.

Efficiency
----------

The big attraction of arrays is that most operations on them are
constant-time and damned fast: accessing an element, moving to the
next or previous element, getting the length of the array.  Even
appending a new element is amortized constant time, and again, damned
fast.

However, singly-linked lists have one potentially huge advantage: they
support structure sharing.  They probably ought to explain this in CS
101 when they explain singly-linked lists, but generally they don't,
and the students are left thinking that singly-linked lists are just a
sort of extreme optimization for cases where you don't want to pay the
extra pointer field and update cost for a doubly-linked list.

But, for example, with singly-linked lists, you can maintain a large
stack throughout a computation --- say, in a shift-reduce parser ---
and make a list of all of the stack states throughout the parsing
process, in a relatively reasonable amount of space.  And, in
particular, you can build up a list element by element without
affecting the views of old versions of the list, and you can likewise
iterate over the suffixes of the list efficiently.

And iterating over a singly-linked list is almost as fast as iterating
over an array, modulo locality of reference.

But all of that is only true as long as you aren't mutating the list.
Once you start mutating the list, all of these nice advantages of
structure sharing disappear.

This drawback may not be very relevant to Bicicleta, where I don't
expect mutation in general to be very common.

The Strange Case of string.join
-------------------------------

A remarkably large share of the uses of array indexing by numerical
variables I found in both my Python code and my JavaScript code were
for some variant of the string.join problem --- given an array of N
strings, connect them with N-1 commas, except when N=0, in which case
we should connect them with N=0 commas instead of N-1 = -1 commas.

In Knuth's article "Structured Programming With Go To Statements", as
I remember it, this kind of loop is called the "N and a half times
loop".  I seem to remember that he somewhere suggested a "loop
... while condition ... repeat" construct which allows you to do the
termination test anywhere in the body of the loop, to make this kind
of loop straightforward to write, and I think TeX ended up with that
as its normal while loop construct.

The code I wrote was generally something along these lines:

    for ii in range(len(items)):
	for item in xform(items[ii]): yield item
	if ii < len(items) - 1: yield ', '

With a Python-syntax version of Knuth's construct, we could remove the
-1 and the redundant test:

    ii = 0
    loop:
        for item in xform(items[ii]): yield item
        ii += 1
    aslongas ii < len(items):
        yield ', '

Using Python's iterator protocol, this can be rewritten in an
integer-free form, although it's slightly longer:

    i = iter(items)
    try: yield i.next()
    except StopIteration: return
    for item in i:
        yield ', '
        yield item

A more powerful iterator protocol would allow a simpler version, very
similar to the index-based version:

    i = iter(items)
    while i.nonempty:
        yield i.current_item
        i.next()
        if i.nonempty: yield ', '

Conclusions
-----------

Even when your collections are vectors, it's probably better to
iterate over those vectors with some kind of iterator object rather
than directly with indices --- the code is generally clearer, and as a
bonus, it can work with non-vector collections.  The cases where
integer indexing is really needed are less-common things like
implementing hash tables, efficient sorting and binary search, heaps,
permutations, 3-D graphics, and binary file format parsing.

However, these less-common things are actually really important, so
the language can't get away without a primitive vector type for a very
long time.

But if most things will be accessing vectors and other collections
using an iterator protocol, then variadic function argument lists and
introspective slot listing don't need to be vectors; they can just be
objects that implement the iterator protocol.  So I can put off the
hairy user-interface-design question of how to deal with simple
ordered collections in a language without mutation for a while longer.

Sat, 07 Apr 2007

> So I thought I'd write a quick little language implementation in
> OCaml, since (a) OCaml is sort of optimized for writing languages and
> (b) I wanted to learn OCaml, including ocamllex and ocamlyacc.

(b) is a big reason for doing things the way you did, but I thought I'd  
point out that, as with LISP or FORTH or dc(1), in APL lexing and parsing  
are nearly trivial* -- possibly due to deliberate design decision,  
possibly to due to the nature of machines (and lack of language theory!)  
at the time.  It might be interesting to see what changes were made to apl  
when it went from being a human-centered didactic notation to a  
machine-implementated executable one.

-Dave

:: :: ::

* subscripting may have been an exception, hence its absence in J?

For instance, if one carries the environment in machine state,

   x*x:a+b

can be scanned a character at a time and directly translated to {load b;  
add a; store x; mul x}, while more modern equivalents,

   x*x where x=a+b
   let x=a+b in x*x

involve some minor parsing first.

So I thought I'd write a quick little language implementation in
OCaml, since (a) OCaml is sort of optimized for writing languages and
(b) I wanted to learn OCaml, including ocamllex and ocamlyacc.  So
here's a tiny, nearly useless subset of APL, supporting only
one-dimensional vectors of floating-point numbers.

Like everything else I post to kragen-hacks without a notice to the
contrary, this code is in the public domain; I relinquish any
copyright.

Here's a sample session:

Beauty:~/devel/toyapl kragen$ ./toyapl_repl
Welcome to toyapl, a tiny APL subset.
Separate numbers by spaces; available ops are + - * / +/ */ % iota
% is modulo; / is division.
This program is not expected to be useful; I wrote it to learn OCaml.
        3150 10000 12000 15000 / 2150
1.46511627907 4.6511627907 5.58139534884 6.97674418605
        (+/ 3150 10000 12000 15000 / 2150) / 4
4.66860465116
        iota 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
        ((iota 10)/10)*((iota 10)/10)
0 0.01 0.04 0.09 0.16 0.25 0.36 0.49 0.64 0.81
        ((1+iota 10)/10)*((1+iota 10)/10)
0.01 0.04 0.09 0.16 0.25 0.36 0.49 0.64 0.81 1
        (iota 15) % 3
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
        ((iota 20)*1+iota 20)/2
0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190

When I get a chance, I'm putting this online at
http://pobox.com/~kragen/sw/toyapl.html and
http://pobox.com/~kragen/sw/toyapl.tar.gz




(* First, here's an expression type, which I put in toyapl_expr.ml *)
type value = float list ;;
type op = string ;;
type expr = 
    Unary of op * expr 
  | Atom of value 
  | Parenthesized of expr 
  | Binary of expr * op * expr ;;  (* later, add variables *)




(* Then, here's toyapl_lexer.mll: *)
{
open Toyapl_parser ;;
open Toyapl_expr ;;
}
let alus = ['A'-'Z' 'a'-'z' '_']
let alnumus = alus | ['0'-'9']
let ident = alus alnumus*
rule next = parse
     [' ' '\t']  { next lexbuf }  (* skip whitespace; stolen from manual *)
   | ['0'-'9']+ ('.' ['0'-'9']+)? 
                           { Num (float_of_string (Lexing.lexeme lexbuf)) }
   | ['+' '-' '/' '*' '%']+ | ident 
                           { Op (Lexing.lexeme lexbuf) }
   | '('                   { Lparen }
   | ')'                   { Rparen }
   | '\n'                  { Eol }




(* Then, here's toyapl_parser.mly: *)
%{
open Toyapl_expr;;
%}
%token <float> Num
%token <Toyapl_expr.op> Op
%token Lparen Rparen Eol
%right Op
%start parse_line
%type <Toyapl_expr.expr> parse_line
%%
parse_line:
        expr Eol            { $1 }
;
atom:
        Num                 { [ $1 ] }
      | atom Num            { $1 @ [ $2 ] }
expr:
        atom                { Atom $1 }
      | expr Op expr        { Binary($1, $2, $3) }
      | Lparen expr Rparen  { Parenthesized($2) }
      | Op expr             { Unary($1, $2) }
      | Lparen Rparen       { Atom([]) }




(* And here's a top-level driver for the main program, which I put in 
   toyapl_repl.ml: *)
print_string "Welcome to toyapl, a tiny APL subset.
Separate numbers by spaces; available ops are + - * / +/ */ % iota
% is modulo; / is division.
This program is not expected to be useful; I wrote it to learn OCaml.
" ;
try Toyapl.repl stdin stdout with End_of_file -> () ;;




(* And here's the main program, toyapl.ml: *)
(* toy APL as an exercise to learn OCaml *)
(* to do: change values to trees *)
open Toyapl_expr ;;

exception Op_not_found of op ;;

let assoc f list = try List.assoc f list 
    with Not_found -> raise (Op_not_found f) ;;

let rec eval_with unaries binaries = let rec eval = function 
    Unary (f, e) -> (assoc f unaries) (eval e)
  | Atom e -> e 
  | Parenthesized e -> eval e
  | Binary (e1, f, e2) -> (assoc f binaries) (eval e1) (eval e2)
  in eval ;;

let show_num n = if n = float_of_int (int_of_float n) 
    then string_of_int (int_of_float n)
    else string_of_float n ;;

let rec show_atom = function
    [] -> "()"
  | n :: m :: lst -> show_num n ^ " " ^ show_atom (m::lst)
  | [n] -> show_num n ;;

let rec show = function
    Unary (f, e) -> f ^ " " ^ show e
  | Atom e -> show_atom e
  | Parenthesized e -> "(" ^ show e ^ ")" 
  | Binary (Atom _ as e1, f, e2) | Binary (Parenthesized _ as e1, f, e2) -> 
      show e1 ^ " " ^ f ^ " " ^ show e2
  | Binary (e, f, e2) -> 
        show (Parenthesized e) ^ " " ^ f ^ " " ^ show e2 ;;

(* numbers up to n. *)
(* it's not clear what iota should do when applied to a nonscalar *)
let apl_iota [n] = let rec iota start stop = if start = stop then [] 
        else float_of_int start :: iota (start + 1) stop
    in iota 0 (int_of_float n) ;;

let unary_lift = List.map ;;
let reduce op id x = [List.fold_left op id x] ;;
let unaries = ["+", unary_lift (fun x -> x); "-", unary_lift (~-.);
                     "+/", reduce (+.) 0.;
                     "*/", reduce ( *. ) 1.;
                     "iota", apl_iota] ;;

exception Mismatched_list_lengths of value * value ;;

let aplbinarylift f a b =
    let flip f a b = f b a
    in
    match (a, b) with
        ([a1], [b1]) -> [f a1 b1]
      | ([a1], b1 :: b2 :: rest) -> List.map (f a1) b
      | (a1 :: a2 :: rest, [b1]) -> List.map (flip f b1) a
      | (_, _) -> try List.map2 f a b with 
            Invalid_argument _ -> raise (Mismatched_list_lengths (a, b)) ;;

let fmod a b = a -. b *. (floor (a /. b)) ;;

let binaries = ["+", aplbinarylift (+.); "-", aplbinarylift (-.);
                "*", aplbinarylift ( *. ); "/", aplbinarylift (/.);
                "%", aplbinarylift fmod] ;;

let eval = eval_with unaries binaries ;;

let show_value = show_atom;;

let parse value = Toyapl_parser.parse_line Toyapl_lexer.next 
    (Lexing.from_string (value ^ "\n")) ;;

let repl inp outp =
    while true do
        output_string outp "        "; flush outp;
        let inval = input_line inp in
        try output_string outp ((show_value (eval (parse inval))) ^ "\n");
            flush outp
        with Parsing.Parse_error -> output_string outp "?parse error\n"
           | Op_not_found (op) -> output_string outp ("?not found " ^ op ^ "\n")
           | Stack_overflow -> output_string outp "?stack overflow\n"
           | Mismatched_list_lengths(a, b) -> 
               output_string outp ("?length mismatch: " ^ show_value a ^ ", " ^ 
                                   show_value b ^ "\n")
    done ;;

(* exception Test_failure of int * 'a * 'a ;; *)

let test () = 
    (* atoms *)
    assert ([45.] = eval (Atom [45.]));
    assert ("45" = show (Atom [45.]));
    assert ([45.; 50.] = eval (Atom [45.; 50.]));
    assert ("45 50" = show (Atom [45.; 50.]));
    (* parenthesized expressions *)
    let f7 = Parenthesized (Parenthesized (Atom [47.])) in
    assert ("((47))" = show f7);
    assert ([47.] = eval f7);
    (* binary ops (depends on addition) *)
    let seven = Binary ((Atom [3.]), "+", (Atom [4.])) in
    assert ("3 + 4" = show seven);
    assert ([7.] = eval seven);
    assert ([7.] = eval (Parenthesized seven));
    assert ("(3 + 4)" = show (Parenthesized seven));
    (* left operands of binary ops: binary ops *)
    let twelve = Binary (seven, "+", Atom [5.]) in
    assert ([12.] = eval twelve);
    assert ("(3 + 4) + 5" = show twelve);
    assert ("(3 + 4) + 5" = show (Binary (Parenthesized seven, 
						"+", Atom [5.])));
    (* left operands of binary ops: atoms *)
    assert ("5 + 3 + 4" = show (Binary (Atom [5.], "+", seven)));
    (* unary ops (depends on negation) *)
    let minus_one = Binary (Atom [3.], "+", Unary("-",Atom [4.])) in
    assert ("3 + - 4" = show minus_one);
    assert ([-1.] = eval minus_one);
    (* left operands of binary ops: unary ops *)
    assert ("(- 4) + 3" = show (Binary (Unary("-", Atom [4.]), 
		                                    "+", Atom [3.])));
    (* no precedence given to unary ops *)
    assert ("- 3 + 4" = show (Unary ("-", seven)));
    assert ([-7.] = eval (Unary ("-", seven)));

    (* unary ops: identity, sum, product *)
    assert ([7.] = eval (Unary ("+", seven)));
    assert ([202.] = eval (Unary ("+/", Atom [49.; 50.; 51.; 52.])));
    assert ([0.] = eval (Unary ("+/", Atom [])));
    assert ("+/ ()" = show (Unary ("+/", Atom [])));
    assert ([6497400.] = eval (Unary ("*/", Atom [49.; 50.; 51.; 52.])));
    assert ([1.] = eval (Unary ("*/", Atom [])));
    (* simple unary ops on vectors *)
    let minus_nums = Unary ("-", Atom [3.; 4.]) in
    assert ([-3.; -4.] = eval minus_nums);
    assert ("- 3 4" = show minus_nums);
    assert ([-3.; -4.] = eval (Unary ("+", minus_nums)));
    assert ("+ - 3 4" = show (Unary ("+", minus_nums)));

    (* other binary ops *)
    assert([-1.] = eval (Binary (Atom [3.], "-", Atom [4.])));
    assert([12.] = eval (Binary (Atom [3.], "*", Atom [4.])));
    assert([60.5] = eval (Binary (Atom [121.], "/", Atom [2.])));
    assert([2.] = eval (Binary (Atom [122.], "%", Atom [10.])));

    (* pointwise binary ops with vectors *)
    assert([7.; 8.; 9.; 10.] = eval (Binary (Atom [3.], "+", 
                                             Atom [4.; 5.; 6.; 7.])));
    assert([7.; 8.; 9.; 10.] = eval (Binary (Atom [4.; 5.; 6.; 7.], "+",
                                             Atom [3.])));
    assert([3.; 3.; 3.; 4.] = eval (Binary (Atom [7.; 8.; 9.; 10.], "-", 
                                        Atom [4.; 5.; 6.; 6.])));
    try ignore (eval (Binary (Atom [2.; 3.], "+", Atom [5.; 6.; 7.]))); 
               assert false with
        Mismatched_list_lengths (a, b) -> 
            assert (([2.; 3.], [5.; 6.; 7.]) = (a, b));

    (* iota *)
    let iota5 = Unary ("iota", Atom [5.]) in
    assert([0.; 1.; 2.; 3.; 4.] = eval iota5);
    assert("iota 5" = show iota5);
    assert([120.] = eval(Unary("*/", Binary(Atom [1.], "+", iota5))));

    (* show_value *)
    assert("0 1 2 3 4" = show_value (eval iota5));

    (* parsing *)
    assert(Atom [23.] = parse "23");
    assert(Atom [2.;3.] = parse "2 3");
    assert(Unary("+/", Unary("iota", Atom [5.])) = parse "+/iota 5");

    (* total system tests *)
    assert("1 2 0 1 2 0 1 2 0 1" = 
           (show_value (eval (parse "((iota 10) - 5) % 3"))));
    assert("0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1" =
           (show_value (eval (parse "(1 + iota 10) / 10"))));
    assert("15 18 21" = (show_value (eval (parse "3 * 4 5 6 + 1"))));
;;
test () ;;




# Then I wrote a build-script:
#!/bin/sh
set -ve
ocamlc -c toyapl_expr.ml
ocamlyacc toyapl_parser.mly
ocamlc -c toyapl_parser.mli
ocamlc -c toyapl_parser.ml
ocamllex toyapl_lexer.mll
ocamlc -c toyapl_lexer.ml
ocaml toyapl_parser.cmo toyapl_lexer.cmo toyapl.ml  # for regression tests
ocamlc -c toyapl.ml
ocamlc -o toyapl_repl toyapl_parser.cmo toyapl_lexer.cmo toyapl.cmo \
       toyapl_repl.ml




# And a clean script:
#!/bin/sh
set -v
# wow, the build process makes 14 files...
rm *~ *.cmi *.cmo toyapl_lexer.ml toyapl_parser.ml toyapl_parser.mli toyapl_repl




# And a Makefile:
# sorry, I wrote this on a Mac without make (!)
all:
	./build

Thu, 05 Apr 2007

Here I discuss the forms the end-to-end principle, considered as a
computer systems architectural principle, has taken over the years,
how it is reshaping society, and how it is merely an extension of the
fundamental human right to freedom of speech --- and the consequences.

I'm sorry to be sending out such a rough draft.  It probably should be
completely rewritten.

Contents
--------

The End-To-End Principle in Mainframes
The End-To-End Principle in the Internet
Scholarly Writing and the End-To-End Principle
The OSI Layer Cake
Content-Centric Networking
Who Are The Ends?
Manifesto
Footnotes

The End-To-End Principle in Mainframes
--------------------------------------

The end-to-end principle was first stated, as far as I know, in Butler
Lampson's "Hints For Systems Design" paper back in the 1970s.
Basically, as I remember it, Butler's argument is that any kind of
error-checking or error correction has to be done by the original
creator of the information and the final user in order to be
trustworthy.  In Butler's example (again, as I remember it), in
getting a data file from the program that originally created it to the
program that eventually uses it, there are a number of places the file
can be corrupted:
- the operating system routines that accept data from the program may
  have defects that corrupt the data;
- the memory in which those routines store the data before writing it
  to disk may be flaky and corrupt the data;
- the bus that connects the memory to the disk controller may be
  flaky;
- the disk controller may be flaky;
- the wires from the disk controller to the disk may be noisy;
- the magnetic domains on the spinning disk may demagnetize over time;
- and when reading the data back, again we must contend with wires,
  disk controllers, buses, memory, and operating system.

Today we could add Solid Oak Software Cyber Patrol corrupting the data
in transit over a network as another potential defect.

In this example, if the original program adds a CRC to the data file,
the program that uses the file will be able to detect all of these
errors with high probability; while if, say, the disk controller adds
a CRC to each data block on disk, it will cost just as much, but will
detect only a fraction of the errors.  While a chain of overlapping
error-checks, each covering only two or three links of the
communications path, could in principle detect an error at any point
along the path, such a chain is much more complex than a single
end-to-end check, and also much more likely to fail.

So this kind of "hop-by-hop" error detection and correction is useful
"as an optimization", because it can often *recover* from errors more
cheaply --- but you shouldn't depend on it.

Butler wrote his paper in the 1970s, when many expensive computers
offered a lot of "hop-by-hop" error detection --- in their CPUs, in
their memory interfaces, in the buses to their peripherals, and so on.

A few years later, Tandem stole the high-reliability computer market
with a design which, I believe, hewed much more closely to the
end-to-end principle: rather than adding trimodular redundancy and
CRCs and the like to each step in the path, they built their computers
as what we would now call "clusters" of individual computers, which (I
believe) had relatively little redundancy inside each one.

The End-To-End Principle in the Internet
----------------------------------------

Telephone networks are designed for high reliability, hop-by-hop: each
node in the path of your call is specified to keep running 99.999% of
the time, at least in the US.  Telco packet data protocols like X.25
used a similar approach --- each router in an X.25 path acknowledges
each packet it receives from the previous router, so that a lost or
corrupted packet can be retransmitted from the previous router,
instead of from the original sender.  Telco protocols for digital
"isochronous" communications, such as voice data, reserve periodic
timeslots for data transmission --- to make sure that no data is ever
delayed or lost in transit, unless something goes wrong.

The internet protocols have a completely different design.  Routers
are free to drop packets at any time without telling anybody, and they
do, even if nothing is going wrong --- simply due to network
congestion.  Typically they only drop about 0.1% to 5% of the packets,
at least according to mtr, but they do drop packets, all the time.
They drop even more packets when a router goes down.

This allows internet protocols to work correctly over lossy radio
networks, noisy modem lines, and long network paths operated by a
dozen or more independent companies.

The sending and receiving computers are responsible for making sure
nothing gets lost or corrupted, usually using a protocol called TCP.
Each packet of data has a checksum to detect (with some probability)
whether the packet has been corrupted in transit, and a sequence
number so the receiver knows what order the packets go in, whether
it's receiving duplicate packets, and whether it's missing any.

    Radio networks typically do retransmit packets at a hop-by-hop
    level, because they often have packet loss rates above 20% per
    hop, and rarely below 5%.  Amateur radio operators use a variant
    of X.25 called AX.25 for this.  If you're transmitting over a
    string of five noisy radio hops, each of which drops 20% of your
    packets, that's 67% packet loss, and TCP doesn't work very well
    any more.  So while TCP is more *reliable* at recovering from
    packet losses (since it handles routers crashing and being buggy),
    AX.25 is more *efficient* at it.  A side effect of radio-network
    packet retransmission is that sometimes it's the acknowledgement,
    not the data packet, that got lost, so sometimes you get multiple
    copies of a packet if you have an 802.11 link in your path ---
    something you'll almost never see with wireline networks like
    Ethernet.

One nice effect of this is that internet communications costs a lot
less than phone communications, because you can do it with shoddier
equipment and more casual agreements among service providers.

The more important effect of this end-to-end principle is that, since
almost all the smarts of an application like voice over IP or file
transfer or hypertext lives in your computer and the computer you're
talking to, not in the dozen routers in between you, you can make a
new networked application and start running it just by convincing one
person to install it on their computer.  This is why telephone
networks have been promising videophones for 50 years, investing lots
of money in services, and never delivering, but it took one creative
college student a few days to revolutionize the music industry with
Napster.

Scholarly Writing and the End-To-End Principle
----------------------------------------------

In mainstream journalism, for some reason, it seems to be considered
bad form to cite sources, so that, for example, your readers can judge
whether a quote was taken out of context; perhaps you're supposed to
trust the journalist to have investigated adequately and represented
the facts accurately, not argue with them.  As you might expect, this
kind of situation seems to attract the kind of writers who are least
likely to investigate adequately.

By contrast, in academic publishing, there seems to be a norm that no
assertion, however widely accepted, should make it into your paper
without a footnote to tell your readers why you believe it, unless
it's a result of your personal direct observation, in which case
you're expected to say that.  Sometimes this makes for arduous reading
and even more arduous writing, but it has the benefit that the reader
can follow the references and, perhaps, find that the citing author
has made a mistake; this makes it less likely that a simple error will
be propagated from generation to generation.

I have seen it asserted somewhere that this norm is what defines
"scholarly writing".

There's a clear analogy here to end-to-end error-correction in
computer systems.  The reader is expected to read the words written by
the guy who originally asserted something and make up his own mind
what they mean and whether they're well-supported --- not rely on
error-free transmission of that information by a pyramid of
intermediate authors.

A weaker form of this norm has arisen in the blogosphere: if you refer
to some published document, whether a blog post or a government
report, whether to cite it with approval or dispute it, you are quite
strongly expected to provide a link to that document, so that your
readers can make up their own minds.  (You are also expected to link
to the source from which you heard about it.)

The OSI Layer Cake
------------------

Often, computer networks are described in terms of a somewhat bogus
seven-layer cake model developed by OSI, the failed phone-company
attempt to reinvent the internet as something stupid.  The lowest
layers are things like the physical medium, the internetworking packet
protocol that packages stuff up so it can move from physical medium to
physical medium, and the stuff that puts checksums and sequence
numbers on it to make sure it can be retransmitted and corruption
detected, and the highest layers are things like data representation
formats and the "application layer", which is the actual "application"
you're using the network for --- HTTP to implement a distributed
hypertext system, say, or IRC for online chatting, or NTP for setting
your computer clock.

The idea is that each communication starts at the top of the stack,
tunnels its way down through the layers to the bottom of the cake,
getting wrapped in successively more goop as it goes, bumps along the
bottom couple of layers to get to its destination, and then erupts up
through the layers at the destination, losing the goop as it goes.

Different network systems have slightly different actual sets of
layers, which map with more or less fudging onto the seven-layer OSI
model.

To be painfully concrete, and use the real layers that actually exist
on the internet, an HTTP request starts as my browser, say, deciding
it wants to load an image from some URL.  So it packages that request
into an HTTP GET request, which is a blob of text a few dozen words
long, and asks the operating system to make it a connection to the
server.  The operating system [0] sends some DNS packets to random
places around the network, which hopefully tell it the IP address of
the server, and so it sends a TCP SYN packet to that IP address asking
to open a new connection.  To actually send the TCP packet, it puts it
inside an IP packet, and then tries to figure out what to do with the
IP packet, which involves consulting its routing tables and probably
deciding it should send it to the same router it uses for everything
else, which is connected to its Ethernet port, at which point it sends
an ARP request out on the Ethernet port asking for the Ethernet
address of the router.  The router sends back an ARP reply with the
router's Ethernet "MAC" address, and then the computer sends the
router an Ethernet packet or "frame" (by wiggling voltages on the wire
between the computer and the router), within which is an IP packet,
within which is the TCP packet or "segment".  Later on, my computer
will send the router more Ethernet frames containing IP packets
containing TCP segments containing the actual text of the HTTP
request.

At the other end, the router connected to the web server sends a
different Ethernet frame to the web server, containing the same IP
packet containing TCP containing HTTP, and the web server goes through
more or less the same process in reverse: examine the IP packet to see
if it's to the web server or to someone else (and who it's from),
examine the TCP packet to see which connection it's for, hand the HTTP
data to the particular server program handling that connection, and
then figure out what the HTTP data means and what to do about it.

So in this case there are actually only five layers --- HTTP, TCP, IP,
Ethernet MAC, and voltages on a wire, not seven.  Other connections
have more or fewer layers, and you can increase the layer count almost
arbitrarily if you start doing stuff like tunneling IP over HTTP,
which you can do.  So those are some of the reasons why the
seven-layer model is pretty bogus, but that doesn't make it useless
for discussion.

Anyway, the "application layer" protocol like HTTP is often called
"layer 7", since it corresponds to the seventh (or maybe sixth and
seventh) layers in the OSI model.  But humorous discussions of network
architecture often mention layers 8 and 9: layer 8 is the guy sitting
at the keyboard clicking the link, and layer 9 is the organization he
works for.

Content-Centric Networking
--------------------------

Van Jacobson is a guy who's famous for developing a lot of the stuff
that keeps the current internet running.  He's recently been working
on some stuff he calls "content-centric networking", on the thesis
that in, say, HTTP, the "ends" being connected are not two computer
application programs, like a web browser and a web server --- they are
a web browser and a named piece of information like an image.  The web
server is often merely a passive conduit through which that
information passes.

He points out that a big part of the difference between the internet
and the telephone network is that the telephone network is basically
providing the service of connecting wires together, so that if you
wiggle the voltage level on one wire, the voltage level on the other
wire wiggles too; while all the internet and computer networking work
is layers on top of that.  And it turned out that it's a lot easier to
route data around at that higher level instead of hooking up a shared
wire with whoever you wanted to talk to.

So the internet was designed, basically, to provide remote-login and
voice transmission services between a guy and a computer, or between
two guys: making a connection.  And all of our security and
performance stuff has been about how to make those connections more
secure and faster and more reliable.

But the web server, which is the ultimate endpoint of the connection
from the HTTP/TCP/IP point of view, is really just another kind of
conduit, and a heck of a lot of the current usage of TCP/IP is for
this kind of thing.  So maybe we should be focusing on talking to the
information instead, which is what Freenet and BitTorrent do, not to
the server.

Who Are The Ends?
-----------------

I talk to people over AIM and Yahoo Messenger all the time.  They both
run on top of a "reliable" protocol, TCP, which is supposed to keep
data from getting lost or corrupted.  But I frequently have trouble
with data getting lost or corrupted with these protocols, for several
reasons:
- Their laptop runs out of battery.
- I lose my wireless connection for too long to keep the TCP
  connection alive.
- I'm using a locutorio computer with Cyber Patrol installed, and it
  cuts the connection because the other person said, "fuck".
- My IM client crashes.
- I send a message that gets eaten by the buggy IM server that relays
  data between us.
- I don't speak Spanish very well, and they're speaking Spanish.

So I use end-to-end error-detection --- I try to remain alert to signs
of miscommunication and communication loss in the conversation, and
"retransmit" by repeating myself or explaining things in a different
way.

Likewise, even the image on the web server is not really the endpoint
of my communication --- it's really the person who put it there, maybe
the owner of the server or maybe not.  The server, in this case, is
merely a medium for communication among people.  [1]

Generally, in all of these examples, computer programs and computer
data are merely proxies for people who want to communicate.

People are the ends.

PEOPLE ARE THE ENDS!

Manifesto
---------

We have begun to transform our society with the power of
communications networks based on the end-to-end principle, as it
changes us in return.  Fundamentally, the power of these
communications networks is the power of free speech or free
communication, which enables the independent investigation of truth.
Free communication is the ability to communicate with anyone who wants
to communicate with you, about anything, as much as you want, in
private or public, as you choose.  End-to-end computer networks make
this ability not only legal, but practical and widespread.

In a world of free communication, abusive dictatorships cannot stand;
human suffering cannot be concealed; information arbitrage in markets
becomes marginal; and political propaganda becomes obvious for what it
is, and only those who seek it out will be deceived by it.

Furthermore, the world of free communication decimates the barriers to
knowledge that keep so much of the human race in poverty.

Finally, this new practical ability to exercise our fundamental right
of free speech has given rise to new ways of developing knowledge and
organizing work: ranging from Wikipedia and the free-software movement
(producing fruits like the Firefox web browser and the Linux kernel)
to eBay, Google, and the blogosphere.

Our world already abounds with the fruits of these changes.

Given these blessings, only in the most extreme cases, such as
preventing imprisoned convicts from plotting further crimes, can
interference with this right of free communication be tolerated.

In particular, this means that the following are intolerable:
- restrictions on the use, ownership, importation, or manufacture of
  particular equipment for communications, particularly restrictions
  intended to prevent people from having access to certain
  information, such as legal enforcement of Digital Restrictions
  Management schemes;
- restrictions on the permissible content of public communications;
- restrictions on the permissible identities of public speakers, or
  requirements that these speakers not be anonymous;
- schemes to prevent anonymous speech and publication, such as
  requiring communications technology to attach traceable serial
  numbers to each communication;
- schemes to prevent private communication, such as prohibitions or
  restrictions on access to encryption hardware or software;
- "takedown notice" mechanisms that enable purported copyright holders
  to exercise prior restraint over the speech of others, particularly
  without prior judicial review.

Furthermore, these are intolerable not just when they are imposed by
the state, but also when they are imposed by private actors with
sufficient power to make them difficult to escape --- for example, by
a telecommunications company.

Footnotes
---------

[0] Actually, this is usually a library which talks to a server at my
ISP, which does the actual work of spewing DNS packets all over the
internet for me.

[1] There are exceptions, like network monitoring reports and computer
simulations, but nearly all the communication over the web is from
people to people, not between people and computer programs.

* Kragen Javier Sitaker <kragen at pobox.com> [2007-04-02 09:40]:
> However, there are several cases where Python's data structures
> are fairly strict in ways that help catch bugs, but which are
> irritating coming from Perl or JavaScript, and which make your
> programs bigger:
> - There's no implicit conversion between strings and numbers, or
>   actually strings and anything.

Note there is interplay here with another issue. Javascript does
two things: overloaded operators *and* implicit conversions. That
easily leads to surprises. F.ex., consider the following:

    function add( a, b ) { return a + b }

This will return a number if both operands are numbers, but a
string if either is a string. If you want to write a reliably
numeric addition, you have to go through contortions:

    function add( a, b ) { return parseInt( a ) + parseInt( b ) }

If you want to reliably “add” (ie. concatenate) strings, it’s not
quite so bad:

    function concat( a, b ) { return "" + a + b }

I consider this botched. Many programmers are not even aware of
these subtleties and those who are will frequently skip the
tedious work of making the code robust.

Python avoids this trap by outlawing implicit conversions.

Perl goes in the other direction: it does implicit conversion,
but not overloaded operators. String concatenation is spelled
`.` whereas numeric addition is spelled `+`. Operands are
implicitly coerced into the type expected by the operator, but
the type of operation is always explicit.

Compare with Python; consider the following function:

    def add( a, b ): return a + b

You can call this either with two numbers, in which case you get
a number back out, or with two strings, which gives you back a
string. You just can’t mix strings and numbers like you can in
Javascript.

So really, Javascript and Perl share one characteristic, but if
you look at the trio from another angle, Javascript and Python
are alike and Perl is distinct.

In fact, Perl 6 goes even further: in Perl 5, the bitwise
operators are overloaded and exhibit different behaviour when
operating on strings vs numbers; as well, the `x` operator is
either string-repeat or list-repeat depending on whether the left
operand is surrounded by parentheses or not. Perl 6 splits all of
these out to separate operators that have exactly one meaning.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Wed, 04 Apr 2007

>                                 The iota operator is fairly useless (I
> think) applied to non-scalars, but it is defined so that its useful
> scalar behavior is just a special case of its more general behavior
> when fed vectors of any length ...

I can't think of definite examples off the top of my head, so the vector  
behavior isn't that useful, but it is useful -- I distinctly remember  
writing code (more than once) that's ravel'ed and reshape'd to produce the  
same effect:

     i. 3
0 1 2
     i. 2 3
0 1 2
3 4 5

(and Numeric's indices() generates an isomorph of the same information)

> Along the same lines, APL's representation of booleans as 0 and 1
> allows you to use +/ to count the number of times something is true...

It seems fitting that the eponym of the "Iverson Bracket" would allow  
mathmatical manipulation of booleans.

What I find more impressive is that APL (judging from the complaints of  
people moving to modern variants which, not being as memory limited, don't  
go to this extreme) would actually store large boolean arrays, not as 0  
and 1 words, but bitwise.

> These aspects of its design are designed to remove redundancy from
> your program, making it shorter and more likely to mean something when
> you screw it up, so it gives the wrong answer rather than simply raise
> an exception.  In general, I do not like this tendency to err quietly.

Part of this may lie in how it was used.  I've been going through  
"Notation as a Tool of Thought", and in general, Iverson builds up his  
functionality in the mathematicians' style: as a sequence of short  
definitions.  Often, after defining a function, he demonstrates how it  
works on sample data -- a very informal variant on unit testing.  So I'd  
expect that the redudancy wouldn't be a big issue if one were to use the  
language as he did; it would be discovered fairly quickly that, although  
the line you'd just typed meant something, it didn't mean what you wanted.

(something else to consider is that he was presumably originally working  
on paper TTYs, as witnessed by the overstruck characters.  When one works  
at a few hundred baud, one makes different tradeoffs on the terse/verbose  
spectrum than when one has multi-megapixel displays.  When one lacks  
full-screen editors, it almost forces coding in very small steps, and  
testing after every step is an easy habit to acquire, especially should  
one be subject to the occasional line-noise*)

-Dave

:: :: ::

* For our younger readers:
I read a blog entry recently where someone pointed out how obsolete the  
RJ-11 modem port on his laptop was.  In those days, modems didn't even  
plug directly into the phone lines.  Everything up to the handset was Bell  
property and not to be tampered with, so you'd take a plain voice  
telephone, then mash the handset into little rubber cups on the modem,  
with actual air pressure waves mediating the data transfer.

see a teletype (with holes on the upper right where its modem was):
http://www.pdp8.net/asr33/pics/med/old_front.jpg

and an acoustic coupler (the modem here is from a decade later, so it's  
probably smaller and three times as fast as what was in the ASR33):
http://williambader.com/museum/modem300/13modem300baudinsideclose.jpg

Mon, 02 Apr 2007

(When reading this, please keep in mind I've never written an
interesting program in APL or PHP, so I am speaking from ignorance.)

APL goes far out of its way to generalize the meanings of things.  For
example, it included a factorial operator (missing in dialects like A+
and K), but rather than give an error when fed a floating-point
number, it would give the value of the gamma function at the point
offset by 1 from the argument --- so it coincides with factorial for
integers, but is more general.  The iota operator is fairly useless (I
think) applied to non-scalars, but it is defined so that its useful
scalar behavior is just a special case of its more general behavior
when fed vectors of any length, pretending that the scalar is a vector
of length 1.

Along the same lines, APL's representation of booleans as 0 and 1
allows you to use +/ to count the number of times something is true;
in K, the 'and' and 'or' functions have additionally been dropped in
favor of the dyadic floor and ceiling functions, which act as 'and'
and 'or' over the domain of 0 and 1.

And then there's its infamous use of every built-in function character
to denote two different functions, depending on whether it is applied
to two arguments or one.  Some characters have even more meanings; I
think / has three (one as the "reduce" operator), and in K I think it
has four.

These aspects of its design are designed to remove redundancy from
your program, making it shorter and more likely to mean something when
you screw it up, so it gives the wrong answer rather than simply raise
an exception.  In general, I do not like this tendency to err quietly.

I was reminded of this language strategy of nonredundancy while
reading the documentation for ocamllex.  

Ocamllex is a lexer generator similar to Mike Lesk's lex, and in the
regular expression patterns that define tokens, you can bind a
variable to a part of the pattern, and there are some rules that
define what the type of the variable is --- whether it's a char, a
string, a char option, or a string option.  (This being Ocaml, you can
do different things with a string, a char, and a string option, so
it's important to know which one you have if you want your program to
compile.)

This reminded me of the big language difference between Python and
both Smalltalk and OCaml, which is that Python has relatively few
kinds of collections, while Smalltalk and OCaml have lots.  Python
doesn't even have a char type; it just uses strings of length 1.
Rather than having a fixed-size array type, a variable-size
OrderedCollection type, a linked-list type, and a special type for
fixed-size arrays of small integers, Python has a single resizable
array type that it calls "list".  (It also has an immutable tuple
type.) Rather than having a red-black tree type, a hash type, an alist
type, a compact type for many different dictionaries sharing the same
set of keys, a skip-list type, and so on, it has a single dict type,
which it implements with hashes.  (However, like Smalltalk, the
interfaces to these collection types are accessed by sending messages
OO-style, so you can make your own collection type.)

I like this a lot, even though it is one factor in making Python
dramatically slower than Smalltalk.  It makes it a lot easier to get
started with the language, and the speed penalty is often acceptable,
and it makes programs shorter.

(In cases where efficiency is really important, there are extension
libraries that provide other kinds of containers, some of which are
even in the standard distribution; but they are used many times less
frequently than lists.)

This struck me as being somehow analogous to the APL approach; where
APL uses one operator to mean many different (but related) things,
Python uses one data structure to represent many different (but
similar) kinds of collections.  It even uses the same indexing
operator to index into dicts and lists.  PHP, JavaScript, and Lua take
this approach even further, in slightly different directions.

So why do I like Python's data structures being general in this way
and dislike APL's operators being similarly general?  I think it's
because I don't expect Python's data structure overloading to hide
bugs, but only to cost performance, while APL's operator overloading
definitely does hide bugs.

However, there are several cases where Python's data structures are
         fairly strict in ways that help catch bugs, but which are irritating
coming from Perl or JavaScript, and which make your programs bigger:
- There's no implicit conversion between strings and numbers, or
  actually strings and anything.
- Trying to access off the end of an array gives you an error rather
  than returning null or resizing the array.  There are explicit
  'append' and 'extend' methods that make the array bigger.
- Dicts are different from lists, so indexing into a list with a
  number, or trying to append to a dict, gives you an error.  (This
  also helps with efficiency; I vaguely recall that Python's
  predecessor ABC used a single painfully-slow AVL tree for a single
  data structure that served both purposes.)
- Dicts are also different from objects, so accessing an object
  attribute with a run-time name requires hairy syntax.  (Also, this
  namespace separation provides more flexibility for emulating dicts
  with user-defined objects; because Perl doesn't do this, it requires
  a much hairier approach to doing this (called "tie"), which
  I consequently use much less.)

You could imagine that Smalltalk's approach of having different kinds
of collections for, say, fixed and variable-size arrays, might also
catch some kinds of errors, in addition to improving speed; but I
don't think it does.