Mon, 28 Nov 2005

(3800 words, terribly sorry)

Lots of web pages are computed dynamically from some set of resources.
For example, a blog's front page is computed dynamically from a
(potentially large) set of blog posts.  The display of an RSS aggregator,
such as the LiveJournal friends page, is computed dynamically from a
subscription list and a bunch of friends' blogs.  Many web pages use some
kind of templating system, so the page consists at least of "content"
and the template.

Performance problems in web page compositing
--------------------------------------------

These applications have a set of performance problems.  Retrieving the
resources they depend on can take time, use bandwidth, and cost other
people money; dynamically generating the web pages themselves uses CPU
time; and all of this often happens while a human being is impatiently
waiting for a response.

Dependency-directed software builds: make
-----------------------------------------

Normally people write software as a human-readable "source code file"
which gets "compiled" (by a slow program) into an "object file" containing
"machine code", which then gets "linked" with other object files (by a
fast program) into an "executable" which they can run.  When they make
changes to the source code, they often want to see the results of their
changes, and they have to rerun this whole process.  But typically, they
haven't changed all the source files, so they only have to recompile
some of the source files.

There's a widely used program called "make" that figures out which
object files are out of date and need to be recompiled from source.
We write a "Makefile" which gives the command to build each object
file, and the command to link the final executable, along with a list
of source files that go into that object file (or object files that go
into that executable.)  If any of these "dependencies" are newer than the
"target", then "make" reruns the command, which "rebuilds" the "target"
from the data in the "dependencies".

Specifying these dependencies is error-prone, and if something fails to
get rebuilt when it should, it often results in hard-to-debug problems.

Interposition in make, with Vesta
---------------------------------

Vesta (http://www.vestasys.org/) is a free-software system for, among
other things, replacing make.  It uses "filesystem encapsulation" to make
sure it has dependencies correct; its approach to tracking dependencies
(http://www.vestasys.org/doc/sdl-ref/walkthrough.html#RunTool) is to
run the compiler (and linker and whatnot) in a chroot in which the only
filesystem available is one provided by Vesta itself.  This means that
it can detect every file the compiler reads, and they're all checked
into its version-tracking repository, so it can reliably tell whether
rerunning a build step would produce the same results as before ---
in which case, it can safely use the previous results.

(Vesta's replacement for make isn't compatible with Makefiles, using its
own syntax and even terminology, and it only works if you store your
files in Vesta's version-tracking repository.  I suspect these may be
reasons Vesta isn't more widely used.  Perhaps someone should create a
version of make that works the same way.)

Interposition in web-page compositing
-------------------------------------

It seems to me that you could use the Vesta strategy with a web-page
compositing system, integrated into a caching proxy HTTP server.
If all the inputs to your web-page generating script are fetched by HTTP
through this proxy, then the proxy can know what they are, and serve
cached output if they haven't changed.  It can measure the CPU cost of
running each script, and only cache output when the time saved is worth
the space used.  It could even run some scripts periodically (if they
take a while to run, and their results are likely to be used) in order
to ensure that their results are available promptly when needed.

You'd have to make sure that you didn't forget any inputs, though --- if you
used any inputs that weren't handled automatically (like the filesystem,
perhaps, or the system clock) you'd want to have a way to handle those
out-of-band.

Common components
-----------------

A caching proxy caches HTTP results at the level of entire
resources (well, their representations).  If you write a
script that depends on eBay's "Toshiba Laptops, 500-666MHz" page
(http://computers.listings.ebay.com/Toshiba_500-666MHz_W0QQfromZR4QQsacatZ31561QQsocmdZListingItemList),
in order to inform you when a new Tecra 8100 is going on sale, you
might not want to rerun your script every time the banner ad on that
page changes.  Instead, you could split it into two parts: one that
executes some query on that page (maybe "<tr> containing "8100" and not
containing <tr>") and makes the results available at some other URL,
and a heavier-weight script that uses the stripped-down page at this
other URL.  Then your heavier-weight script would only run when the part
of the page it cared about had changed.

This lightweight part-extractor component will be useful in many
applications, since often you only care about a part of the pages you
depend on, and it can make this fact visible to the execution environment.

An analogous component is one that combines several resources into one,
perhaps so that you can extract parts from the combination more easily,
or perhaps so that you can display the combination itself as your
"portal" page.  By separating out this "compositing" or "templating"
function as a separate task, you allow the various pieces to be recomputed
(and possibly cached) independently.

A third common component, which might subsume the task of the first two,
is an XSLT transformer.

The Atom format, which is normally used for sequences of updates, has
several potentially interesting components that can live within this
framework.  Specifically, it has a "truncator" that drops all but the
most recent N events from a given feed; an "aggregator" that takes
several Atom feeds and interleaves their posts according to time; and
an "encapsulator" or "watcher" that takes an entire web page and
encapsulates it as a single Atom item in a one-item-long feed.

Other features of this execution environment
--------------------------------------------

In addition to network access, input caching, output caching, CPU
measurement, and speculative pre-caching (this last I haven't
discussed), the following features would fit in nicely:
- special handling of Cache-control: max-age (and Age: of course.).  Scripts
  inside the system could use this to prevent the proxy from polling an origin
  server too often, and normally it would propagate from a request for a
  composited page to the requests used to fetch the components of that page.
- invalidation propagation.  Speculative pre-caching, if it produces a new
  representation for an existing resource, may invalidate other cached
  representations that were computed from that one.  (HTTP might not consider
  those representations "stale", but they no longer reflect the current state
  of the resources they depend on.)  The execution environment should propagate
  the cache invalidation.  In some cases, it may be wise to speculatively
  recompute those other resources as well.  (All of this might also occur while
  a request is being processed, but it would be wise to defer any speculative
  recomputation until the response is finished.)
- special HTTP error handling.  In many cases, it's better to get an
  out-of-date cached representation for a resource than an error
  message, and any caching proxy would be in a position to provide this
  benefit (although I don't know of any that do).  In other cases, you
  might want to propagate e.g. 500 and 404 errors through automatically,
  rather than having special-case code in your service that handles
  this.  (Perhaps that's better to put in a library rather than an
  execution framework, though.)

The choice of which representations to cache cannot be solved locally, since
the benefit of caching something depends on the cost of computing it.
Something that makes sense to cache because it depends on some uncached and
expensive computation might cease to make sense to cache once that uncached
computation gets cached itself.  It turns out that this is related to
well-known problems in OLAP.  See J. Ullman, "Efficient Implementation of Data
Cubes via Materialized Views", http://dbpubs.stanford.edu/pub/1996-32.

Given limited resources, the optimal frequency to speculatively recompute
something (whether it's as simple as a cached copy of a web page, or
something much more CPU-intensive) does not increase monotonically as
the update rate of its dependents increases.  At some point, it is no
longer worth it to re-fetch web pages more often in hope that they will be
up-to-date when they are needed.  See Arasu, Cho, Garcia-Molina, Paepcke,
Raghavan, "Searching the Web", http://dbpubs.stanford.edu/pub/2000-37.

Other systems complementary to this dependency-driven execution environment
---------------------------------------------------------------------------

Without a protocol for pushing cache invalidations across the network,
the above-described cache-invalidation protocol only works inside a
single server.  Given a cache-invalidation protocol for HTTP resources
(for example, the one suggested in my zLab note, "Synchronizing REST
Resources With Asynchronous Notification: A Practical Proposal", which
presently resides at
http://wiki.commerce.net/wiki/Synchronizing_REST_Resources_With_Asynchronous_Notification:_A_Practical_Proposal),
it is possible to prevent any composited page from being out of date
except when the network is unreliable or under conditions of excessive
load.

Such a protocol could be added to existing web sites with a program we
at CommerceNet have been calling a "delta proxy": a reverse proxy that
sits in front of your web server and notices when resources'
representations change.  It is then in a position to provide various
kinds of change notifications, such as a Google "sitemap" file, RSS and
Atom feeds, or pushed cache invalidations.

With Ajax, you can actually push the update notification all the way
to the web browser, updating information on somebody's screen in real
time.  (The current suite of Ajax tools don't support asynchronous
event delivery to the browser --- they don't hold any connection open
to the server --- but as we showed at KnowNow, it is possible to do it
with all modern browsers with some difficulty.)

Many applications you might want to build would require some
historical information --- for example, if you have a web page that
shows the current temperature in Portland, Oregon, you might want to
graph it over time.

Given the previously-mentioned "watcher" component that encapsulates a
web page as an Atom feed, there's one additional component that
doesn't quite fit into this framework that can enable such
applications.  It's a "feed extender", which turns a shorter Atom feed
(such as a one-item feed) into a longer feed, with up to N items, by
remembering items it saw in the past and inserting "new" items at the
beginning of the feed.

There's also the question of how to avoid scaling problems in
publishing of resources like this.  I think we can build a
decentralized network that gives you a few orders of magnitude
improvement here; I've written about this previously at
http://wiki.commerce.net/wiki/Point_to_Point_Multicast_Atom_Network
If you supplement the system described there with cryptographic
signatures with serial numbers, with key hashes included in the URL, you
can avoid the necessity for anyone to hit the origin server, so there
need not be a publicly-accessible origin server any more.

Internal State as Cached Computations on I/O History
----------------------------------------------------

Several formalisms describe a program as a function from an input
history to an output action; my own "rumor-oriented programming" idea
<http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html>
depends on this.  If you map your input history into URL space as REST
resources, then your program can be entirely stateless --- all its state
can live out in the web, with this caching proxy to save it from taking
a long time to do anything at all.

Of course, you don't want *any* change to the input history to
invalidate *all* your previous computations.  So you structure your
program in parts.  Suppose, for concreteness, you have a blogging
application whose input history consists of HTTP form POSTs to some URL.
You have one resource that lists all the past form POSTs as links, in
order, and one resource for each form POST.  Your blog-front-page
program looks at this form POST index and issues a request for each
group of ten form POSTs, containing their URLs, to a summarization
script, which returns the dates of the blog posts contained in those
form POSTs.  Then the front-page program decides which ten or fifteen
form POSTs are relevant to displaying the front page, and fetches them
directly.  Although the front-page program's cached results are
invalidated whenever a new POST is made, the summarization script's
results are not, and most of the computation (trawling through massive
amounts of text and deciding which posts are authorized and which
aren't, for instance) is in the summarization script.

In general you can turn a computation on your entire input history into
more-cacheable computations on subsets of it in this way.  It may be
difficult to fit the specifications for these computations into a URL,
though.

Some apps depend on the past representations of some other
URL-addressable resource instead of explicit user actions.  These apps
can be wrapped up into this framework with the Atom "watcher" and "feed
extender" components discussed above.

Sample app outline: eBay querier
--------------------------------

Suppose you want to follow the current market price of Toshiba Tecra
8100s on eBay over time.  eBay used to allow easy searching of
recently-completed auctions, which gave you an easy way to get the
current market price: look for successfully-completed auctions of recent
similar items and use the final selling price.  "Buy it now!" added a
confounding factor, since it allowed a bad seller to depress the
apparent market price, but you could ignore the "Buy it now!" hits.

But first they moved the "search completed items" option off the main
search page onto search-completed.ebay.com.  Now even that requires that
you log into your eBay account, so they can track who's searching for
what.

My approach now is to do a current search, find the non-buy-it-now
auctions ending soonest, bookmark the links to those auctions, and
follow those links after the auctions are scheduled to end.  This is a
little laborious, and it would be nice to automate.

So you establish a "watcher" and a "feed extender" with a polling
interval of, say, an hour, on some relevant eBay category page.  Then
you write a script which looks at the "extended feed" and, given a
search string, finds items in a recent version of the page matching the
string that are not there in the current version of the page.  Then it
fetches the items' individual auction pages and extracts the number of
bids and the closing price, and returns a table of these with the item
descriptions and links.

Sample app outline: webmail viewer
----------------------------------

How could you build an application like Squirrelmail or Hotmail in this
framework?  (This relates to the application area I had in mind in
"rumor-oriented programming".)

(This is somewhat oriented to my own style of email usage, in which all
the email I have ever received is a database of useful facts and
ideas, but I rarely read new email often than twice a week.  Apparently
for most people it's a sort of reliable work queuing system, which they
use as a sort of to-do list, and delete items once they're done.)

You map the history of incoming email messages into URL space: perhaps a
web page consisting of links to all the incoming messages for a
particular account for a particular month, and another "master index"
web page listing the total number of messages in each month, with links
to the individual-month pages.  These pages are not accessible to
everyone, perhaps by virtue of having unguessable URLs that never leak
outside the system.

This email-history necessarily lives outside the framework I've been
discussing, since it has internal persistent state.  It might store the
messages in MySQL, with a thin layer of Perl or Python on top to render
incremental indices in the manner described above.

This allows you to display read-only views of recent messages; to
optimize this, you might want "most recent 100 messages", "most recent
1000 messages", and "most recent 10000 messages" resources that can be
cached internally, computed from the master index and some number of
individual month pages; and these summaries can include information
commonly used for search, such as From, To, Subject, and References.

If you want full-text search, you can compute inverted-index resources
per month.  (There's no point in creating per-year inverted indices,
because they would depend on the master index, and so they would become
invalid any time you received a new mail message.)  You might cache a
merged index for all months except the current one, by encoding the list
of months in the URL, and use that index (usually cached) plus the
current-month index (fast to recompute) for full-text searches.

Of course, you'd probably prefer not to have the full-text searcher
"download" the entire inverted index for every search, even from a local
cache.  But you can map arbitrarily small parts of it into URL space:
words beginning with 'i', words beginning with 'in', words beginning
with 'ind', etc.  Of course, the thing that extracts that small part
must "download" the entire inverted index, but not very often.

To provide deletion and filing in different folders, you need to store a
history of user actions, i.e. form posts.  Then you can compute
resources such as "deletions from folder INBOX", "deletions from folder
Work", "filings into folder Work", etc., from this history of user
actions, and the display of the folder Work might depend on the "filings
into folder Work" and "deletions from folder Work" resources, plus some
random collection of incoming message resources.  

Suppose you want to subdivide "deletions from folder INBOX" into smaller
subsets that pertain to individual months of incoming messages, so that
you can look at lists of messages you received (but didn't delete) last
year, and not have them take a long time to display.  So you have a
"deletions from folder INBOX of messages originally received in
November 2003" resource, which depends on "deletions from folder INBOX"
as well as "messages received in November 2003".  But "deletions from
folder INBOX" changes fairly often, and whenever it does, it invalidates
"deletions from folder INBOX, November 2003".  Recalculating that
resource's representation might show that it didn't change, so there's
no need to recalculate other things that depend on it ("display of INBOX
messages received in November 2003").  But, since "deletions from INBOX
Nov 2003" involves reading all of "deletions from INBOX", it's probably
more expensive to recompute anyway.

Of course, user actions that have some effect outside this system, such
as sending email to someone outside the system, don't fit inside this
framework; you need some external "gateway" program for that.

Sample app outline: custom portal
---------------------------------

You want a web page that tells you the top news stories, the prices on
your favorite stocks, the current value of your stock portfolio,
headlines from BBC News, and your current to-do list items (pulled from
some external system like Tadalist.)

If you have the XPath-extractor service described earlier, you can
describe this almost entirely in terms of a sequence of URLs whose
<body> contents should be concatenated.  (You might need an XSLT
transformer for the RSS feeds.)  Then you can put this sequence of URLs
on a page on your personal web site and pass its URL to the "custom
portal app" running inside this dependency-driven recomputation
environment; it reads your list of URLs and fetches a resource for each
URL, some of which (the XPath-extractor, perhaps) are inside the DDR
system and depend on external URLs.  External URLs are cached and
fetched rarely, and the compositing program doesn't have to think about
any of it.

Complementary infrastructure
----------------------------

I've mentioned a few services above that make this system more useful,
while standing outside the strict dependency-driven evaluation graph;
Here's a list of the things I've thought of:
- the Atom "feed extender" listed above, which provides a window into
  the past;
- the form-POST logger occasionally mentioned above; probably this will
  be most useful if its URL is secret or otherwise not globally
  accessible.
- URL shorteners, along the lines of makeashorterlink.com, provide four
  useful services in this context:
  - if your overall tree of dependencies is large, you may have a hard
    time fitting it into a single URL query string (containing encoded
    URLs containing their own query strings pointing at still other
    URLs).  A URL shortening service can keep the URL lengths under
    control.
  - if you can change the URLs they point to, they serve the function of
    mutable variables.
  - if they proxy for URLs rather than redirecting to them, they can
    provide revocability for access to secret capability URLs.
  - if they can take a query string with named parameters and combine
    that with any original query string, they provide currying.
- a user interface that appreciates that URLs can take other URLs as
  parameters; see
  <http://lists.canonical.org/pipermail/kragen-tol/2005-September/000793.html>
  "Currying as a UI technique for devolution of power to users" for my
  thoughts on that, particularly the scenario entitled "An Example
  Application" and the explanatory text in the section "A REST-backed
  drag-and-drop UI".
- a web-wide invalidation protocol, like that described in the zLab note
  mentioned above.

Holes
-----

The inverted-index case points out that, in some cases, one small
program can compute a large number of resources ("words beginning with
a; words beginning with b; etc.") with no more work than it needs to
compute one of them.

As illustrated by the "INBOX deletions, Nov 2003" example above, because
there's no way to store information inside the system, there's no way to
deal with externally provided resources that are too large in
granularity and change frequently.  You can cache smaller resources that
contain subsets of them, but revalidating those cached subsets involves
rereading the entire large resource.

The idea that a dependent resource might not actually change when the
thing it depends on changes poses a couple of problems:
- it casts doubt on the idea that a cost-based optimization algorithm
  can a priori figure out a good set of things to cache
- it interferes with eager invalidation, where the invalidations flow
  forward along the data flow edges without recomputing any of the
  intermediate values unless the invalidations reach something that's
  currently being observed.  ("eager invalidation" is the lazy
  evaluation scheme described in
  <http://lists.canonical.org/pipermail/kragen-tol/2005-May/000778.html>
  "lazy evaluation in a distributed system", and
  <http://lists.canonical.org/pipermail/kragen-hacks/2005-June/000412.html>
  "lazy evaluation in Python in a way that supports distributed
  operation", and
  <http://lists.canonical.org/pipermail/kragen-hacks/2004-June/000399.html>,
  "dynamically updating dependency networks in Smalltalk".)  I think
  it's possible to reconcile the two schemes by having three states per
  cached resource (valid, possibly invalid, invalid) instead of just
  two, but I haven't done the work to find out.

Credits
-------

The ideas in here were inspired by work by, and were honed in
discussions with, Dan Moniz, Ben Sittler, Rohit Khare, Allan Schiffman,
Donovan Preston, and Mark Lentczner.

Sat, 26 Nov 2005

I was thinking last night about how an election can fail to have a
Condorcet winner, and I wondered what fraction of all possible
elections had that property.

Rather than spending half an hour looking up the answer on Google and
reading it (finding e.g. http://arxiv.org/abs/math.ST/0511140 for
three candidates) I thought it would be fun to spend three hours
hacking up some code that wouldn't actually give me an answer, but
maybe would give me some feel for the answer.  (It turns out that the
vague answer is that having no Condorcet winner is a common
phenomenon.)

So here it is.  It, or some later version, is also online at
http://pobox.com/~kragen/sw/condorcet.html

<html><head><title>Condorcet cycles</title>
<script type="text/javascript">
/* TODO:
 * D create adjustable parameters for voters and candidates
 * D randomly generate voter prefs with button
 * D display voter prefs
 * D display voter prefs in columns of table
 * D come up with candidate names
 * D come up with candidate colors
 * D display candidate/candidate matrix
 * D compute Condorcet winner
 * (at this point I only had 70 minutes left before kragen-hacks goes out)
 * D compute FPTP and Borda count winners
 * (at this point I had 20 minutes left)
 * - compute dominant cycle if no winner
 * - color-code voters on mouseover of candidate comparisons
 * - support drag-and-drop modification of voter prefs?
 * D generate candidate colors randomly?
 */


/* very basic functions needed even for class Candidate */

function map(func, alist) {
    var rv = []
    for (var ii = 0; ii < alist.length; ii++) 
        rv.push(func(alist[ii]))
    return rv
}

function tag(tagname) {
    return function(content) {
        return '<' + tagname + '>' + content + '</' + tagname + '>'
    }
}
function font(color, content) {
    return tag('font color="' + color + '"')(content)
}
function all(alist) {
    for (var ii = 0; ii < alist.length; ii++)
        if (!alist[ii]) 
            return false
    return true
}            
function sum(alist) {
    var rv = 0
    for (var ii = 0; ii < alist.length; ii++)
        rv += alist[ii]
    return rv
}


/* class Candidate */

function Candidate(name, color) { 
    this.name = name 
    this.color = color
}
Candidate.prototype.as_html = function() { 
    return font(this.color, this.name)
}
Candidate.prototype.compare = function(other_candidate, voters) {
    var wins = 0
    var losses = 0
    for (var ii = 0; ii < voters.length; ii++) {
        if (voters[ii].prefers(this, other_candidate)) {
            wins++
        } else {
            losses++
        }
    }
    return [wins, losses]
}
Candidate.prototype.beats = function(other_candidate, voters) {
    var comparison = this.compare(other_candidate, voters)
    return comparison[0] > comparison[1]
}
Candidate.prototype.beats_all = function(candidates, voters) {
    var self = this
    return all(map(function(candidate) {
        return self.beats(candidate, voters)
    }, candidates))
}
Candidate.prototype.fptp_votes = function(voters) {
    var self = this
    return sum(map(function(voter) { return voter.prefs[0] == self ? 1 : 0 },
                   voters))
}
Candidate.prototype.borda_count = function(voters) {
    var self = this
    return sum(map(function(voter) { return voter.borda_count(self) }, voters))
}

/* end class Candidate */


/* very basic functions needed for class Voter */

function identity(obj) { return obj }
function copylist(alist) { return map(identity, alist) }
function randrange(max) { return parseInt(Math.random() * max) }
function method(methodname) {
    // XXX should take extra parameters somewhere...
    return function(obj) { return obj[methodname]() }
}

function Voter(candidates, voterid) {
    this.voterid = voterid
    this.prefs = copylist(candidates)
    for (var ii = 0; ii < this.prefs.length; ii++) {
        var pos = randrange(ii+1)
        var tmp = this.prefs[pos]
        this.prefs[pos] = this.prefs[ii]
        this.prefs[ii] = tmp
    }
}

Voter.prototype.ballot = function() {
    return ['Voter #' + this.voterid].concat(map(method("as_html"), this.prefs))
}

Voter.prototype.prefers = function(a, b) {
    for (var ii = 0; ii < this.prefs.length; ii++) {
        var cand = this.prefs[ii]
        if (cand == a) return true
        if (cand == b) return false
    }
    throw "candidate not in list"
}
Voter.prototype.borda_count = function(candidate) {
    for (var ii = 0; ii < this.prefs.length; ii++)
        if (this.prefs[ii] == candidate) 
            return this.prefs.length - ii
    throw "candidate not in list"
}
            

/* end class Voter */


/* main program routines, with HTML and so forth */

td = tag('td')
tr = tag('tr')
th = tag('th')

function elem(id) { return document.getElementById(id) }


function table_innerHTML(matrix) {
    var rv = []
    for (var ii = 0; ii < matrix.length; ii++) {
        var row = []
        for (var jj = 0; jj < matrix[ii].length; jj++)
            row.push((ii == 0 ? th : td)(matrix[ii][jj]))
        rv.push(tr(row.join('\n')))
    }
    return rv.join('\n')
}
function transpose(matrix) {
    var rv = []
    // XXX does not work for empty matrix
    for (var jj = 0; jj < matrix[0].length; jj++) rv.push([])
    for (var ii = 0; ii < matrix.length; ii++)
        for (var jj = 0; jj < matrix[0].length; jj++)
            rv[jj].push(matrix[ii][jj])
    return rv
}
function display_voters(dest, voters) {
    elem(dest).innerHTML = table_innerHTML(transpose(
        map(method('ballot'), voters)))
}


function candidate_row(candidates, voters) {
    return function(row_candidate) {
        return tr(th(row_candidate.as_html()) + map(
            function(column_candidate) { 
                if (row_candidate == column_candidate) return td('')
                var comp = row_candidate.compare(column_candidate, voters)
                var color = ((comp[0] > comp[1]) ? row_candidate.color :
                             (comp[1] > comp[0]) ? column_candidate.color :
                             'grey')
                return td(font(color, comp[0] + '-' + comp[1]))
            }, candidates).join('') )
    }
}

function compose(f, g) { return function(obj) { return f(g(obj)) } }
function display_pairwise(dest, candidates, voters) {
    var cand_hdr_top = tr(td('') + map(compose(th, method("as_html")), 
                                       candidates).join(''))
    var cand_rows = map(candidate_row(candidates, voters), candidates)
    elem(dest).innerHTML = cand_hdr_top + cand_rows.join('\n')
}


function display_winner(dest, candidates, voters) {
    var winner = "No candidate"
    for (var ii = 0; ii < candidates.length; ii++) {
        var cand = candidates[ii]
        if (cand.beats_all(candidates, voters)) {
            winner = cand.as_html()
        }
    }
    elem(dest).innerHTML = winner + " is the Condorcet winner."
}

function update_winner(current, name, score) {
    if (score > current[1]) {
        current[0] = name
        current[1] = score
    } else if (score == current[1]) {
        if (current[0])  /* initially null */
            current[0] = "tied"
    }
}

function display_other_winners(table, fptp, borda, candidates, voters) {
    var table_headers = [tr(th('Candidate') + th('FPTP') + th('Borda'))]
    var fptp_winner = [null, 0]
    var borda_winner = [null, 0]
    var table_rows = map(function(candidate) {
        var fptp_votes = candidate.fptp_votes(voters)
        update_winner(fptp_winner, candidate.as_html(), fptp_votes)
        var borda_count = candidate.borda_count(voters)
        update_winner(borda_winner, candidate.as_html(), borda_count)
        return tr(th(candidate.as_html()) + td(fptp_votes) + td(borda_count))
    }, candidates)
    elem(table).innerHTML = table_headers.concat(table_rows).join('\n')
    elem(fptp).innerHTML = fptp_winner[0]
    elem(borda).innerHTML = borda_winner[0]
}



function candidate_name(ii) {
    /* http://www.census.gov/genealogy/names/names_files.html */
    var candidate_names = [
        'Margaret', 'Dorothy', 'Charles', 'Richard', 'Maria', 'Thomas', 
        'Jennifer', 'Elizabeth', 'Joseph', 'David', 'James', 'Mary', 
        'Patricia', 'William', 'John', 'Susan', 'Michael', 'Barbara', 
        'Linda', 'Robert',
    ]
    if (ii < candidate_names.length)
        return candidate_names[ii]
    return 'Candidate #' + ii
}

function randhex() {
    return ['0', '1', '2', '3', '4', '5', '6', '7',
            '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'][randrange(16)]
}
var candidate_color = function(ii) {
    var colors = [ 'black', 'red', 'blue', 'cyan', 'magenta', 'green',
                  'yellow', 'orange', 'purple', 'bluegreen' ]
    if (ii < colors.length) return colors[ii]
    var r = randhex()
    var g = randhex()
    var b = randhex()        
    return '#' + r + r + g + g + b + b
}

function range(n) {
    var rv = []
    for (var ii = 0; ii < n; ii++) 
        rv.push(ii)
    return rv
}

/* main program entry point */

function election(form) {
    var n_voters = form.voters.value
    var n_candidates = form.candidates.value
    var candidates = map(function(ii) { 
        return new Candidate(candidate_name(ii), candidate_color(ii)) 
    }, range(n_candidates))

    var voters = map(function(ii) {
        return new Voter(candidates, ii)
    }, range(n_voters))
    display_voters('voterlist', voters)
    display_pairwise('matrix', candidates, voters)
    display_winner('winner', candidates, voters)
    display_other_winners('othervotingsystems', 'fptp-winner', 'borda-winner',
                          candidates, voters)
}
</script>
</head>
<body onload="election(document.forms[0])">
<h1>Condorcet winners</h1>
<p>The "Condorcet winner" of an election is the candidate that is
preferred by a majority of the voters over any other single candidate.
The basic idea is that each voter has some order of preference of the
available candidates, and you can tell which candidate they prefer out
of any pair by comparing their ranks in that list.</p>
<p>A well-known problem is that, if there are more than two candidates
and more than four voters, there may not be a Condorcet winner in a
particular election.  The case with three candidates ABC and five voters
is as follows: ABC, ABC, CAB, CAB, BCA.  A is preferred to B by four out
of five; C is preferred to A by three out of five; and B is preferred to
C by three out of five.  Such examples demonstrate that the idea of a
Condorcet winner does not solve the problem of collective choice; it's
an interesting question how probable the existence of a Condorcet winner
is.</p>
<p>So here's a little experimental environment.  You can set the number
of voters and the number of candidates and randomly generate the
preferences for each voter, and see the Condorcet results....</p>
<form onsubmit="election(this); return false">
<label for="voters">Voters: </label><input id="voters" 
    name="voters" value="15" size="4" />
<label for="candidates">Candidates: </label><input id="candidates" 
    name="candidates" value="9" size="4" />
<input type="submit" value="Elect!" />
</form>
<h2>Individual voter ballots:</h2>
<table id="voterlist">
<tr><td>No election yet.</td></tr>
</table>
<h2>Pairwise matchups:</h2>
<table id="matrix">
<tr><td>No election yet.</td></tr>
</table>
<h2>Condorcet winner</h2>
<p>A Condorcet winner will appear in the matrix of pairwise matches
with their entire row and their entire column a single solid color:
their color.</p>
<p id="winner">No election yet.</p>
<h2>Other voting systems</h2>
<p>In first-past-the-post (used, for instance, in the US presidential
election and most other US elections) each voter votes for exactly one
candidate, and the candidate with the most votes wins.  If we assume
"sincere voting" --- each voter votes for their first choice --- we
can deduce the FPTP results from the preference orders displayed
above; the winner is <span id="fptp-winner">not yet computed</span>.
Obviously this is a terrible system, and every other voting system
proposed so far has been proposed because its proponents think it is
better.</p>
<p>In the Borda count, your last choice gets 1 point; your
next-to-last choice gets 2 points; and so forth, until your first
choice gets as many points as there are candidates.  The Borda count
winner of this election is <span id="borda-winner">not yet
computed</span>.
</p>
<table id="othervotingsystems">
<tr><td>No election yet.</td></tr>
</table>
</body></html>

Thu, 24 Nov 2005

(1200 words, sorry)

I was talking with Tantek Celik tonight (2005-11-23) for an hour or two
about folksonomies, why they work, and user-defined data structures.
The ideas in this post came almost entirely from our discussion, but I
don't know which ones he deserves credit for and which ones I deserve
blame for, but he definitely deserves credit for the term "folkschemas".

A couple of new services (Google Base[insert ref], Ning/24HL[insert
ref], JotSpot <http://www.jot.com/>) allow people to put arbitrary sets
or bags of key-value pairs into a big shared store, and then run
arbitrary queries on that store, queries more or less of the type that
Partial Match Indexing [insert ref] optimizes: "field foo has value bar,
and field baz has value quux," etc.  This is sort of the thing RDF was
intended to support [insert ref], but without a single centralized
database --- just a bunch of XML files on the web.  So if a group of
people happen to use the same field name (predicate name, edge label,
relationship, etc.), you can search across all of their data records
with a single query, at least if that data is indexed in some index that
you have access to.

Obviously, this can result in the Tower of Babel problem: it's not at
all certain that different people will in fact use the same field names
when they mean the same thing, and, worse, it's not at all certain that
they will use different field names when they mean different things.

This is also a potential problem with folksonomy systems like del.icio.us:
I may choose 'sf' as my tag for a semantic category that you call
'sanfrancisco' or 'san-francisco', and so searches by tag won't have
very good recall, and searches that exclude by tag won't have very good
precision.  RDF's solution to this, I think, is OWL [insert ref], where
you specify mappings between relationships and other objects you think
are equivalent for the purpose of your query.

The Microformats effort <http://www.microformats.org/> is working to
establish broadly interoperable schemas for the most widely used data
types, such as personal contact information and event announcements,
with defined concrete representations in HTML, semantics, and pragmatic
intent for each format.  Microformats solves the Tower of Babel problem
within the domains where the largest benefit attaches to doing so, using
a lightweight version of the normal standards process: document existing
needs, research existing solutions, propose a solution that spans most
of the existing uses, and seek consensus on the solution.

But what about domains with only a few tens of thousands of currently
published structured data items?  These include horserace odds, used-car
listings, electronic part listings, personal ads, classical music album
information, satellite photos, software bug reports, word definitions
from dictionaries, software source code patches, expense reports,
college course schedules, prayers, and so forth.

In domains like these, even the lightweight Microformats specifications
process may be enough effort to discourage implementors from
participating, much as most people who want to categorize their
snapshots or web bookmarks aren't interested in participating in the
discussions of the Library of Congress or Open Directory topic headings.
It's easy enough for people working in these domains to come up with
some set of field names and stick their data into Google Base or Ning.
Is there some kind of even-lighter-weight standardization process that
might make their schemas more likely to interoperate?

The cases of Flickr and del.icio.us --- the canonical folksonomies ---
suggest that there might.  The systems are perfectly usable without
looking at anyone else's stuff, but people get network-effect benefits
from using tags that others use: people can find their items more
easily, they get better suggestions for tags to apply to their items,
their items are classed with other items that are more closely related,
and so forth.  Gradually, over time, people tend to adopt tags that are
more similar to other people's tags.

Tantek coined the term "folkschemas" to describe potential sets of field
names that would converge in the same gradual way.  So far, we don't
have a lot of examples of this happening in the real world; geotagging
of photos on Flickr, with 94000 current examples, is perhaps the most
prominent example.  If there are only a few examples with wide
applicability, a Microformats-style lightweight standards process might
not be so bad.  It's only if there is a "long tail", with an almost
infinite number of formats popular enough that interoperability matters
but unpopular enough that nobody wants to bother to spend a week or two
on a standard, that "folkschemas" are really a win.

Before del.icio.us and Flickr, there had been a number of attempts at
folksonomy-like systems that had never achieved any noticeable level of
common categorization without a formal standards process: the meta
keywords HTML tag, the Keywords fields of various scientific article
formats, the old McBee and Indecks card systems.  We hypothesized that
the software in these two cases had reached the point where the gentle
incentives to converge on common categorizations were strong enough that
people began to do it, and brainstormed a bit about things that could
enable the same sort of effect for folkschemas.

The "heat map" or "tag cloud", which displays some set of tags with
sizes indicating their relative popularity, is one innovation ---
<http://flickr.com/photos/tags/> and <http://del.icio.us/tag/> [verify
ref] are two examples, but they're found all over the folksonomy web.  A
recent innovation is to look just at the subset of tags that co-occur
with a single other tag [insert ref].  Displaying a "heat map" of field
names, perhaps from the subset of field names that co-occur with the other
fields you've already put into the current record, could facilitate
schema convergence.

Likewise, if you've typed a data value such as "SWM", it could be
helpful to have a "heat map", or just a sorted list, of the field names
that have previously been attached to that value in the records produced
by other people, and if you've typed a field name such as "Current
Mood", it could be helpful to have a "heat map" or sorted list of the
other data values that occurred in that field in existing records.

In a system such as Google Base that supports fielded searches, it would
be helpful to know which fields were commonly specified in searches, and
what values were being searched for --- not just what fields and values
others had specified.

Once you have some data entered with some set of field names, it would
be helpful to find out what other fields were similar in content to the
field names you chose, and what their records looked like; this would
enable you to decide whether or not to establish a rule, for example,
that your "lastname" field should also be published as "surname" --- or
possibly whether or not to simply rename the field to "surname" if
that's what's more commonly used.  "Heat maps" or sorted lists of
alternatives might prove useful in this case as well.

Finally, when you perform a query that selects some set of data records,
it would be nice to be able to get "heat maps" and sorted lists of the
field names and field values used within the set of query results.

Mon, 21 Nov 2005

Suppose you're in the not terribly unusual position of having a bunch
of big files (megabytes at least) on a bunch of unreliable disks.
Every once in a while, one of the disks fails, or user error deletes a
file.  But you would prefer not to lose the data in any of these
files, because it's valuable.  What do you do?

A straightforward and reliable solution is to keep multiple copies of
each file, say two or three, and a message-digest of each file (such
as MD5 or SHA-1) so you can detect silent corruption.  When you notice
a disk has failed or a file is missing, you make another copy of one
of the remaining copies.

This scheme is appealingly simple, but the copies are expensive, which
creates a temptation to skimp on the mirroring in some applications.

ECC
---

An error-correcting code transforms some number of bits N into a
larger number of bits M, and can do the reverse transformation
(usually very quickly) even if you flip or erase some of the M bits.
Typically you can erase up to any M-N bits or flip up to any
ceil((M-N-1)/2) bits.  In many codes, the first N of the M bits are
identical to the original M bits --- you just append M-N bits to the
original bits.  (I think these codes are called RCH codes.)

The simplest such code is the parity code: M=N+1, and you XOR your N
bits together to get your Mth bit.  This can detect single-bit errors,
and if you know which bit is lost, it can correct them, too.

So if you have N files of about the same size, you can read through
them in parallel and generate M-N "ECC files" containing those extra
error-correction bits.  (For concreteness, we could talk about
RS(28,24), a Reed-Solomon code that takes 24 8-bit symbols at a time
and adds 4 more to get a total of 28 8-bit symbols.)  When you get
past the ends of some of the files, you can pretend they have zero
bytes there.

These ECC files contain enough information to replace any one of the
existing files, should it be lost or corrupted.  Depending on which
code you use, you may be able to replace several lost files, as long
as the others aren't lost.  (With RS(28,24), you could lose up to four
of the existing files, as long as you knew they were lost.)

So, generate your ECC files from a group of files that are on separate
disks (ideally on separate machines, in separate racks, in separate
hosting centers, served by separate networks, on separate continents)
and store them on other files on yet other disks (preferably as
separate as possible, as well).  Then you can recover any of the files
if they get lost, at relatively low overhead cost.

Finding the original files
--------------------------

There's a practical difficulty if you have a bunch of ECC files full
of apparently random binary data, though, which is that they're only
useful for data reconstruction if you can find the original files they
were generated from (and the other ECC files as well).  So I suggest
beginning the file with some kind of boilerplate header, saying "This
is an ECC file generated by ECC-file-mangler version 2.4, available at
http://www.example.com/eccfile; see the end of the file for more
details", and appending a list of metadata about the input files to
the end of the file, so that you can figure out which of your zillions
of not-lost files you need to read to reconstruct your lost files.

Under the title "filesystem metadata indexing, yet again", at
http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000383.html
I published a program that will run through your filesystem to compute
various metadata about all your files.  The result looks something
like this:

atime: 1074220014
bytes: 1624
ctime: 1072920796
dev_inode: 769_1460288
fips_180_1_sha: a2e62e2d1bfd0e654eeae0b7976478c418dfcaa4
md5: 5da8553a4f8f3f136853c44e8322529c
mode: 0100664
mtime: 1072920796
name: ../Toshiba%2520Pin%2520Password%2520Reset.htm
uid_gid: 500_500

This contains, among other things, the file's original name, the
number of bytes in it.  Metadata like this appended to the end of the
file would enable you to figure out which among your thousands of big
files the ECC file pertains to, and perhaps what to call the
reconstructed result.

Sat, 19 Nov 2005

This is on the Web at http://pobox.com/~kragen/sw/dhtml-sniki.html.  I
wrote it in May but for some reason, as far as I can tell, I never got
around to telling anyone about it, perhaps because I was too ashamed of
the lameness of the implementation.

Like everything else posted to this mailing list without notices to the
contrary, this is in the public domain.

<html>
<head>
<title>Sniki-like DHTML</title>
<!--

    Darius Bacon wrote this amazingly cool hack called SnikiSniki, a
    Wiki that contained typed links --- rather like RDF triples.  I
    wanted to play with this same idea in DHTML (probably with Ajax
    persistence eventually).

    This implementation is really lame in several ways:
    - My implementation of depth-first search on the "knowledge base" is
      awfully clumsy; I anticipate that in time I will learn to express
      this concept more clearly in code.
    - Despite living in the browser, the user interface is the
      traditional "type text strings into a small box and hit enter"
      that we've all come to know and love over the last few decades.
    - There's no way to get the first few results before chewing up
      large amounts of browser time.

-->
<script src="http://pobox.com/~kragen/sw/jstest-v1.js"></script>
<script>
// -*- c -*-
// simple (yet over-complex) triple-store library; originally in a
// separate file, included here for email

function makebits() {
    var stack = [[]]
    var results = []
    while (stack.length) {
        var next = stack.pop()
        if (next.length == 3) results.push(next)
        else {
            var choices = [1, 0]
            for (var choice in choices) {
                stack.push(next.concat([choices[choice]]))
            }
        }
    }
    return results
}

function split_triple(triple) {
    return triple.split(/\s+/)
}

function is_var(atomname) {
    return atomname.match(/^[A-Z]/)
}

function freevars(triples) {
    var rv = []
    var seen = {}
    for (var ii in triples) {
        var items = triples[ii]
        for (var jj in items) {
            if (is_var(items[jj]) && !seen[items[jj]]) {
                rv.push(items[jj])
                seen[items[jj]] = 1
            }
        }
    }
    return rv
}

function make_object(names, values) {
    var rv = {}
    for (var ii in names) rv[names[ii]] = values[ii]
    return rv
}

function map(cb, list) {
    var rv = []
    for (var ii = 0; ii < list.length; ii++) rv.push(cb(list[ii]))
    return rv
}

function instantiate(triples, instantiations) {
    return map(function(triple) {
        return map(function(atom) {return instantiations[atom] || atom}, triple)
    }, triples)
}

function queryfunction(query_triples) {
    var split_query_triples = map(split_triple, query_triples)
    var variables = freevars(split_query_triples)
    var stack = [[]]
    var rv = []
    while (stack.length) {
        var assignments = stack.pop()
        var state = make_object(variables, assignments)
        var instantiated_triples = instantiate(split_query_triples, state)
        if (this.may_contain(instantiated_triples)) {
            if (assignments.length == variables.length) {
                rv.push(state)
            } else {
                var current_variable = variables[assignments.length]
                var possible_values = this.get_candidate_values(instantiated_triples, current_variable)
                for (var ii in possible_values) {
                    stack.push(assignments.concat([possible_values[ii]]))
                }
            }
        } 
    }
    return rv
}

function triple_match(pattern, triple) {
    var rv = {}
    for (var jj in pattern) 
        if (!is_var(pattern[jj]) && pattern[jj] != triple[jj]) return null
        else rv[pattern[jj]] = triple[jj]
                 // XXX note that this will be overly optimistic for multiple occurrences of the variable!
    return rv
}

function contains(array, item) {
    for (var ii = 0; ii < array.length; ii++)
        if (array[ii] == item) return true
    return false
}                                   

function triplestore() {
    this.triples = []
    this.add = function(triples) {
        for (var x in triples) this.triples.push(split_triple(triples[x]))
    }
    this.query = queryfunction
    this.may_contain = function(triples) {
        for (var ii in triples)
            if (!this.may_contain_triple(triples[ii])) return false
        return true
    }
    this.may_contain_triple = function(triple) {
        for (var ii in this.triples) {
            if (triple_match(triple, this.triples[ii])) return true;
        }
        return false
    }
    this.get_candidate_values = function(triples, variable) {
        // XXX highly duplicative of may_contain!
        var rv = []
        var seen = {}
        for (var ii in triples)
            if (contains(triples[ii], variable))
                for (var jj in this.triples) {
                    var match = triple_match(triples[ii], this.triples[jj])
                    if (match && !seen[match[variable]]) {
                        rv.unshift(match[variable])
                        seen[match[variable]] = 1
                    }
                }
        return rv
    }
}

tests = {
    bits: function() { 
        assert_equal(makebits(), [[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1],
                                  [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]]) 
    },

    is_var: function() {
        assert(is_var("A"), "A")
        assert(!is_var("a"), "a")
        assert(is_var("Ax"), "Ax")
        assert(!is_var("xA"), "xA")
        assert(is_var("Z"), "Z")
    },

    freevars: function() {
        assert_equal(freevars([]), [])
        assert_equal(freevars([['a', 'b', 'c']]), [])
        assert_equal(freevars([['a', 'b', 'C']]), ['C'])
        assert_equal(freevars([['A', 'B', 'C']]), ['A', 'B', 'C'])
        assert_equal(freevars([['A', 'A', 'A']]), ['A'])
    },

    make_object: function() {
        assert_equal(make_object(['beatrice', 'kragen', 'mikel'], ['murch', 'sitaker', 'jay']),
            {beatrice: 'murch', kragen: 'sitaker', mikel: 'jay'})
    },

    triplestore: function() {
        var ts = new triplestore()
        ts.add(['a b c'])
        var yes = [{}]
        var no = []
        assert_equal(ts.query([]), yes)
        assert_equal(ts.query(['a b c']), yes)
        assert_equal(ts.query(['a b d']), no)
        assert_equal(ts.query(['a b c', 'a b c']), yes)
        assert_equal(ts.query(['a b c', 'a b d']), no)
        assert_equal(ts.query(['a b d', 'a b d']), no)
        assert_equal(ts.query(['a b d', 'a b c']), no)
        assert_equal(ts.query(['a b X']), [{X: 'c'}])
        assert_equal(ts.query(['a X c']), [{X: 'b'}])
        assert_equal(ts.query(['X b c']), [{X: 'a'}])
        assert_equal(ts.query(['a X d']), no)
        assert_equal(ts.query(['a X Y']), [{X: 'b', Y: 'c'}])
        assert_equal(ts.query(['a X X']), no)
        assert_equal(ts.query(['A B C']), [{A: 'a', B: 'b', C: 'c'}])

        ts.add(['same c c', 'a b e'])
        assert_equal(ts.query(['a b c']), yes)
        assert_equal(ts.query(['a b c', 'a b e']), yes)
        assert_equal(ts.query(['a b e', 'a b c']), yes)
        assert_equal(ts.query(['a b d', 'a b c']), no)
        assert_equal(ts.query(['A B C']), [
            {A: 'a',    B: 'b', C: 'c'},
            {A: 'a',    B: 'b', C: 'e'},
            {A: 'same', B: 'c', C: 'c'},
        ])
        assert_equal(ts.query(['A B C', 'A b c']), [
            {A: 'a', B: 'b', C: 'c'},
            {A: 'a', B: 'b', C: 'e'}])
        assert_equal(ts.query(['a b X']), [{X: 'c'}, {X: 'e'}])
        assert_equal(ts.query(['X Y Y']), [{X: 'same', Y: 'c'}])
        assert_equal(ts.query(['A B C', 'X C Y']), [
            {A: 'a', B: 'b', C: 'c', X: 'same', Y: 'c'},
            {A:"same", B:"c", C:"c", X:"same", Y:"c"}])
    },

    triplestore_bits: function() {
        var ts = new triplestore()
        ts.add(['bit bit 0', 'bit bit 1'])
        assert_equal(ts.query(['bit bit B0', 'bit bit B1', 'bit bit B2']), 
                     [{B0:"0", B1:"0", B2:"0"}, {B0:"0", B1:"0", B2:"1"}, 
                      {B0:"0", B1:"1", B2:"0"}, {B0:"0", B1:"1", B2:"1"},
                      {B0:"1", B1:"0", B2:"0"}, {B0:"1", B1:"0", B2:"1"},
                      {B0:"1", B1:"1", B2:"0"}, {B0:"1", B1:"1", B2:"1"}])
    },
};

</script>
<script>
// DHTML bits for this sample page

function tablerow(contents, type) {
    if (!type) type = 'td'
    var el = document.createElement('tr')
    for (var ii = 0; ii < contents.length; ii++) {
        var td = document.createElement(type)
        td.appendChild(document.createTextNode(contents[ii]))
        el.appendChild(td)
    }
    return el
}

function clear_node(nod) {
    nod.innerHTML = ''
    // Mozilla erroneously caches some layout... fix it
    var old_display = nod.style.display
    nod.style.display = 'none'
    nod.style.display = old_display
}

function perform_query(qel, ts, triples) {
    clear_node(qel)
    var headers = freevars(map(split_triple, triples))
    qel.appendChild(tablerow(headers, 'th'))
    var data = ts.query(triples)
    for (var ii in data)
        qel.appendChild(tablerow(map(function(propname){return data[ii][propname]}, headers)))
}

var ts = new triplestore()

function handle_input(somestuff) {
    var triples = somestuff.split(/,\s*/)
    var free = freevars(map(split_triple, triples))
    if (free.length) {
        perform_query(document.getElementById('querytable'), ts, triples)
    } else {
        // add to triple store
        var tb = document.getElementById('stufftable')
        for (var ii in triples) tb.appendChild(tablerow(split_triple(triples[ii])))
        ts.add(triples)
    }
    return false
}

function set_up_test_stuff() {
    handle_input('beatrice parent aggie, beatrice spouse kragen, kragen spouse beatrice, beatrice parent walter, kragen parent greg')
    handle_input('aggie spouse walter, greg spouse paula')
    var query = 'Person spouse Spouse, Spouse parent In-law, In-law spouse In-law-2'
    var field = document.getElementById('theform').stuff
    field.value = query
    handle_input(query)
    field.focus()
}

</script>
</head>
<body onload="barebones_run_tests();set_up_test_stuff()"><h1>Toy triple store in DHTML</h1>
<p> Inspired by Sniki.</p>
<table id="stufftable">
<tr><th>object</th><th>property</th><th>value</th></tr>
</table>
<table id="querytable">
</table>
<form id="theform" onsubmit="return handle_input(this.stuff.value)">
<input size="80" name="stuff" />
</form>
</body>
</html>

Thu, 17 Nov 2005

Disclaimer
----------

I know basically nothing about OCR, handwriting recognition, speech
recognition, or automatic translation.


Tolerance for Ambiguity Makes Things Easier
-------------------------------------------

Traditionally, the goal of OCR, handwriting recognition, speech
recognition, or automatic translation has been to find the single
correct interpretation of marks on paper or sounds in a signal stream.
Often, though, it's much easier to produce a description of a set of
possible interpretations (perhaps as a finite-state machine or
infinite-order Markov model) which reliably includes the correct
interpretation than it is to reliably choose the correct
interpretation.  When is this ambiguous representation good enough to
be useful?

Downstream disambiguation
-------------------------

Traditional (e.g.) OCR software proceeds as a pipeline of stages:
convert an image into marks; convert the marks into symbols; convert
the symbols into letters; turn the letters into lines of words,
columns, paragraphs, and so forth.  In its simplest form, this works
very badly; one way to remedy this is to pass sets of candidates from
one stage to the next.

Ocrad, for example, has a -x option to list all the candidate
transliterations for each symbols, including certainty levels, so that
perhaps a downstream program such as a dictionary-driven spell-checker
could disambiguate symbols by context.  (Unfortunately, it isn't
actually useful in Ocrad 0.11, because it almost never comes up with
more than one candidate, and the candidate weights aren't useful
either.)

This is not the only possible way to do downstream disambiguation; for
example, you could resume the algorithm upstream (with Icon or Python
generators, or Haskell lazy lists, or Prolog backtracking, or
simulating these by hand) to get it to generate more possibilities, or
you could run through the entire pipeline several times, each time
feeding the output of the next stage into the previous stage as a set
of weights.

T9 text input on cell phones is an example of this technique: the
keyboard produces a finite-state-machine specification of a word
(e.g. the regexp '[tuv][ghi][def]') which is then disambiguated to a
list of the most probable candidates (perhaps 'the', 'tie', 'vid', in
this case) by consulting a dictionary; then the person who typed the
word selects the desired word from the list of words.

In an OCR system, this approach could simplify the proofreading
process

Search
------

Suppose you have an NFA representation of a text generated by an OCR
program.  You can construct an NFA representation of the suffixes of
the text by adding a new state with epsilon-transitions to all the
existing states.  Now, if you have an NFA representation of some other
word (generated by speech or handwriting recognition, for example)
that you want to find in this document, you can simulate the two NFAs
in parallel to find what points in the document might contain it.

I suspect this could be very useful.

Given infinite-order Markov models of the text --- that is, NFAs with
probabilities on the transitions --- you could even include the
likelihood of a match in the inputs to a TF/IDF-like ranking
algorithm, so that less certain matches ranked lower.

Related work
------------

Presumably most work in OCR, neural networks, speech recognition,
handwriting recognition, etc., deals with tolerance of ambiguity, and
I wouldn't be surprised if there were a bunch of work dealing with
uses for ambiguous final products.

The Soundex scheme, now nearly a century old, reduces similar-sounding
words to small equivalence classes by omitting all vowels, lumping
consonants into six consonant classes of similar-sounding letters, and
collapsing runs of the same consonant class down into one; the idea is
to support search by removing most of the random information
introduced into surnames by spelling variation, as well as some of the
signal.  You could view a Soundex heading such as "S420" as an NFA
(/^s[aehiouwy]*(l[aehiouwy]*)+([cgjkqsxz][aehiouwy]*)+$/ I think,
producing words like "slice"), so it's pretty closely related in some
sense.

Tue, 15 Nov 2005

This is one of the stages of the project described at
http://lists.canonical.org/pipermail/kragen-tol/2005-October/000794.html.

This is currently running at http://considerate.murch-sitaker.org:8000/
and mostly seems to work OK technically --- the UI is a different
problem altogether.  New users universally enter the word they're
looking for in the guidewords boxes.  I haven't yet experimented with
ways to solve this user interface problem, but it clearly needs solving.

I'm also planning to use a Google-Maps-like interface to avoid having to
download the entire 750K JPEG to see any part of the full-res page.

I need to figure out how to put the gigabyte of page image data up on
the web.

#!/usr/bin/python
import twisted.web.resource, twisted.internet.reactor, twisted.web.static, sys
import os.path, nevow, twisted.web.error, urllib, random, string, pickle
import urlparse
T = nevow.tags
# To do:
# D put up on web
# D make reduced images <1024 pixels high
# kragen at considerate:/mnt/raid/kragen/media/oed-v$ time
# for x in newenglishdict05murrmiss_jpg/*.jpg; do
#    convert -size 1024x1024 -resize 1024x1024 "$x"
#       thumbnails/"$(basename "$x")";
#    echo $x done;
# done
# D make navigation among these reduced images possible by wrapping them in
#   pages
# D remember the words at the top of each image
# D support navigation by words
# - make much smaller thumbnails somehow --- copying 100 thumbnails to
#   panacea only managed 40KB/s in 3:46, i.e. 226 seconds, or 2.26
#   seconds each.  Perhaps only the top 200 pixels or so, or 16KB each?
# D keep a file that knows the words at the top of each image
# - make a torrent for the images
# - make a streamlined data entry UI for guidewords?
# - remove "if it's in the book" when it's a headword
# - never redirect back to the same page

def flatten(stan): return str(nevow.flat.flatten(stan))

def dict_precedes(a, b):
    def transformed(word):
        word = word.lower()
        return ''.join([letter for letter in word
                        if letter in string.ascii_letters])
    return transformed(a) < transformed(b)

class Book:
    def __init__(self, images, thumbnails, guidewordlog):
        self.images = images
        self.thumbnails = thumbnails
        self.thumbnail_list = os.listdir(self.thumbnails)
        self.guidewords_list = [None] * len(self.thumbnail_list)
        self.guidewordlog = guidewordlog
        self.read_guidewords()
    def read_guidewords(self):
        self.guidewordlog.seek(0)
        try:
            while 1:
                pageno, first, last = pickle.load(self.guidewordlog)
                self.guidewords_list[pageno] = first, last
        except EOFError:
            pass
    def images_resource(self):
        return twisted.web.static.File(self.images)
    def thumbnails_resource(self):
        return twisted.web.static.File(self.thumbnails)
    def viewer_resource(self):
        return ThumbnailNavigator(self)
    def thumbnail_url(self, index):
        return "/thumbnails/" + self.thumbnail_list[index]
    def fullsize_url(self, index):
        return "/images/" + self.thumbnail_list[index]
    def add_guidewords(self, index, first, last):
        self.guidewords_list[index] = first, last
        print "Page", index, "has guidewords", first, "and", last
        pickle.dump((index, first, last), self.guidewordlog)
        self.guidewordlog.flush()
    def get_guidewords(self, index):
        return self.guidewords_list[index]
    def should_be_on_page(self, index, word):
        guidewords = self.get_guidewords(index)
        if not guidewords: return False
        first, last = guidewords
        return (not dict_precedes(word, first) and
                not dict_precedes(last, word))
    def look_for(self, word):
        last_before, first_after = 21, 1298  # XXX for OED vol V
        word = word.lower()
        for ii in xrange(len(self.guidewords_list)):
            guidewords = self.get_guidewords(ii)
            if not guidewords: continue
            if self.should_be_on_page(ii, word): return ii
            first, last = guidewords
            if dict_precedes(last, word) and ii < first_after:
                last_before = ii
            if dict_precedes(word, first) and ii < first_after:
                first_after = ii
        if last_before < first_after - 1:
            return int((last_before + first_after) / 2)
        else:
            # one of these must be wrong?
            return random.choice([last_before, first_after])

class ThumbnailPage(twisted.web.resource.Resource):
    def __init__(self, book, index):
        twisted.web.resource.Resource.__init__(self)
        self.book = book
        self.index = index
    def render_GET(self, req):
        print "Displaying page", self.index
        form = ''
        if req.args.has_key('q'):
            word = req.args['q'][0]
            if word[0].lower() not in 'hijk':
                warning = "(Not in this dictionary, which only covers HIJK.)"
            elif self.book.should_be_on_page(self.index, word):
                warning = "(Should be on this page if it's in this book.)"
            else: warning = ''
            first, last = '', ''
            guidewords = self.book.get_guidewords(self.index)
            if guidewords: first, last = guidewords
            form = T.form(method="POST")[
                T.p["Searching for ", T.b[word], ". ", T.b[warning]],
                "Enter the guide words at the top of the page: ",
                T.input(type="text", name="first_word", value=first),
                T.input(type="text", name="last_word", value=last),
                T.input(type="hidden", name="q", value=word),
                T.input(type="submit", value="Update"),
            ]
        return flatten(
            T.html[T.head[T.title['Page number ', str(self.index)]],
                T.body[
                    form,
                    T.script(type="text/javascript")[
                        "document.forms[0][0].focus()"
                    ],
                    T.a(href=self.page_link(self.index-1, req))["Prev"], ' ',
                    T.a(href=self.book.fullsize_url(self.index))[
                        T.img(src=self.book.thumbnail_url(self.index),
                              align="top")
                    ],
                    ' ', T.a(href=self.page_link(self.index+1, req))["Next"],
                ],
            ])
    def render_POST(self, req):
        if req.args.has_key('first_word'):
            first = req.args['first_word'][0]
            last = req.args['last_word'][0] or first
            self.book.add_guidewords(self.index, first, last)
        base_url = req.prePathURL()
        word = req.args['q'][0]
        recommended_page_number = self.book.look_for(word)
        newurl = str(recommended_page_number) + '?q=' + urllib.quote(word)
        newurl = urlparse.urljoin(base_url, newurl)
        req.redirect(newurl)
        return ''  # who cares about Opera 1.0?
        
    def page_link(self, index, req):
        if req.args.has_key('q'):
            return '%d?q=%s' % (index, urllib.quote(req.args['q'][0]))
        else: return str(index)

class ThumbnailNavigator(twisted.web.resource.Resource):
    def __init__(self, book):
        twisted.web.resource.Resource.__init__(self)
        self.book = book
    def getChild(self, childname, req):
        try: index = int(childname)
        except: return twisted.web.error.NoResource("Misspelled page number.")
        return ThumbnailPage(self.book, index)
    def render_GET(self, req):
        return flatten(
            T.html[T.head[T.title["OED Volume V"]],
                T.body["See ", T.a(href="page/0")["the first page"],
                    ' or search for a word from H to K.  Searching may ',
                    'require that you type in the guidewords you see on ',
                    'up to six pages before you get to the correct page: ',
                    T.form(method='POST', action='page/1')[
                        T.input(name='q', value='hawk')
                    ],
                    T.script(type="text/javascript")[
                        "document.forms[0][0].focus()"
                    ],
                ]])

indexpage = flatten(
    T.html[T.head[T.title["Index to OED volume V site"]],
        T.body[T.h1["Index to OED volume V site"],
            T.ul[
                T.li[T.a(href="page")["My prototype viewer"]],
                T.li[T.a(href="images/")["Archive's index page"]],
                T.li[T.a(href="thumbnails/")["Raw thumbnails dir"]],
            ],
            T.address[T.a(href="mailto:kragen at pobox.com")[
                "Kragen Sitaker"]],
        ],
    ]
)

def ok(a, b): assert a == b, (a, b)
def test():
    assert dict_precedes('ideologically', 'idiom')
    assert dict_precedes('Ideologically', 'idiom')
    assert dict_precedes('ideologically', 'Idiom')
    assert dict_precedes('Had-I-Wist', 'Haematite')
    assert dict_precedes('Hades', 'Had-I-Wist')
    assert dict_precedes('Hackthorn', 'Haddock')
    assert dict_precedes('Haddock', 'Hades')
    assert dict_precedes('Hackthorn', 'Hades')
test()

def main(port):
    root = twisted.web.resource.Resource()
    imagebase = '.'
    images = os.path.join(imagebase, 'newenglishdict05murrmiss_jpg')
    assert os.path.exists(images)
    thumbnails = os.path.join(imagebase, 'thumbnails')
    guidewordlog = os.path.join(imagebase, 'guidewords')
    book = Book(images, thumbnails, file(guidewordlog, 'r+'))

    root.putChild('', twisted.web.static.Data(indexpage, 'text/html'))
    root.putChild('images', book.images_resource())
    root.putChild('thumbnails', book.thumbnails_resource())
    root.putChild('page', book.viewer_resource())

    twisted.internet.reactor.listenTCP(port,
        twisted.web.server.Site(root))
    print "listening", port
    twisted.internet.reactor.run()
if __name__ == '__main__':
    main(int(sys.argv[1]))

Mon, 14 Nov 2005

> So for $3150 plus a case and some buttons, you could have all of
> Wikipedia at your fingertips, plus 2000 Project Gutenberg ebooks, plus
> 12000 pages scanned from paper books.

Beating a classical technology equivalent, The Penguin "Complete"  
Collection[0], by about 77 linear feet and $4850...

-Dave

[0]  
http://www.amazon.com/exec/obidos/tg/detail/-/0147503078/ 
qid=1131998118/sr=2-1/ref=pd_bbs_b_2_1/103-8346050-0715004? 
v=glance&s=books
(1082 titles, "nearly half a million pages")

E-Ink is selling $3000 "technology evaluation kits" for its "digital
paper" --- 400MHz XScale Gumstix board pre-installed with Linux (with
Bluetooth, USB, MMC), 800x600 6" E-Ink display, batteries, power
supply.  No case or human input device.  US$3000; fax them an order
form.

<http://www.eink.com/kits/>
<http://www.linuxdevices.com/news/NS9257262400.html>

This is nice, because for a little over $100 ($120 or so) you can get
2GB MMC cards <http://www.pricewatch.com/prc.aspx?i=226&a=396866>.  At
present, Project Gutenberg's FTP site lists 25004 text files totaling
17.45 GB (by my crude count) <insert reference here>.  You could
compress almost 5GB of text files onto this 2GB MMC card; in the case
of Project Gutenberg, that would be around 7000 text files, probably
7000 books --- a substantial fraction of the English literary canon.

>From the Internet Archive's Million Books scanning project, I
downloaded a scan of a dictionary (Walter Farquhar Hook's Church
Dictionary, 1842).  It weighs in at 24.5 MB for high-res scans of 558
pages, compressed with DjVu.  2GB, then, is space for almost 49000
scanned pages.

In December 2004, English Wikipedia's TomeRaider file was 537MB
compressed
<http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database>, but now
it appears to be close to 900MB
<http://download.wikimedia.org/tomeraider/current_tr3/> including
images and text but excluding page history and sounds.

So for $3150 plus a case and some buttons, you could have all of
Wikipedia at your fingertips, plus 2000 Project Gutenberg ebooks, plus
12000 pages scanned from paper books.

In itself, this is interesting, but it gets better.  The Librie
EBR-1000EP cost US$390 and contained essentially the same hardware,
minus the large flash.  126mmx190mmx13mm, 190g (300g with case and
batteries).
<http://66.150.196.37/forums/showthread.php?t=2403&goto=nextnewest>.
It was greeted with enthusiasm in the press
<http://technology.guardian.co.uk/online/story/0,3605,1197495,00.html>.
It wasn't marketed in the US, and the available books cost US$12 and
disappeared after 60 days.  I heard a rumor that they eventually
opened it up to more formats:
<http://www.engadget.com/entry/1234000790024272/>.  There are 1389
members of the Librie Yahoo mailing list
<http://groups.yahoo.com/group/librie/> with about 6 messages per day;
people are working on <http://www.sven.de/librie> reverse-engineering
the file-format.

So imagine $500, $250, or $100 for a similar device.

(The Gumstix computer running this thing costs on the order of$99 and
dissipates "under 250 mA", i.e. close to 1.25W at full speed.)

Other possible applications for such machines (with their 1000-ms screen
update times, during which time the screen flashes all black) include: 
- displaying maps
- puzzle games (that is, games where you spend a lot of time thinking)
- crossword puzzles
- viewing concordances
- hypertext navigation

(The standard Slashdot comment here is, of course, "Imagine a Beowulf
cluster of these things!"  A 200W Gumstix Beowulf might include 200
Gumstix boards, cost $30 000, weigh as much as a light laptop, execute
80 billion instructions per second, and contain 12GB of RAM.  That's
roughly the same MIPS per dollar as an x86 desktop PC (times or
divided by two?), but 10x better MIPS per watt.  So there might be
some potential here.)

Thu, 10 Nov 2005

An efficient way to eliminate possible (prime) factors in your head is
to add and subtract multiples of the possible factor to the number to
be factored so as to make a multiple of 10, then divide the number by
10.  The resulting number will be divisible by the possible factor
(unless the possible factor contains 2 or 5) iff the original number
was.

How do you know what to multiply your possible factor by?

If your possible factor ends with 1 or 9, then:
- if your putative composite ends in 1 or 9, multiply by 1;
- if it ends in 2 or 8, multiply by 2;
- if it ends in 3 or 7, multiply by 3;
- if it ends in 4 or 6, multiply by 4;
- if it ends in 5, multiply by 5.

>>> lastdigits = Numeric.multiply.outer([1, 3, 7, 9], Numeric.arange(10)) % 10
>>> complements = 10 - lastdigits
>>> Numeric.minimum(lastdigits, complements)
array([[0, 1, 2, 3, 4, 5, 4, 3, 2, 1],
       [0, 3, 4, 1, 2, 5, 2, 1, 4, 3],
       [0, 3, 4, 1, 2, 5, 2, 1, 4, 3],
       [0, 1, 2, 3, 4, 5, 4, 3, 2, 1]])

So if your possible factor ends with 3 or 7, then:
- if your putative composite ends in 1 or 9, multiply by 3;
- if composite ends in 2 or 8, multiply by 4.
- if composite ends in 3 or 7, multiply by 1;
- if composite ends in 4 or 6, multiply by 2;
- if composite ends in 5, multiply by 5.

So you really only need to remember the sequence 3 4 1 2 5 for 3 or 7.

So to factor 3829:
- it's not divisible by 2 or 5 obviously;
- the digit sum is 4, so it's not divisible by 3;
- adding 21 (3*7) gives 3850, which is 10*385; subtracting 35 (5*7)
  gives 350, which is 10*5*7.  So it's divisible by 7, so you can do
  the long division to figure out what's left: 547.
  - No need to test 547 for divisibility by 2, 3, or 5.
  - It's not divisible by 7 because it's 54*10+7, and 54 isn't divisible by 7.
  - 5+7-4 is 8, which is not congruent to 11, so it's not divisible by 11.
  - 547+13 is 560, or 10*56, and 56 isn't divisible by 13.
  - 547-17 is 530, or 10*53, and 53 isn't divisible by 17.  (53+17=60.)
  - 19*3 = 57; 547-57 = 490, which isn't divisible by 19.
  - 547+23 = 570, which isn't divisible by 23.
  - 29*3 = 87; 547-87 = 460, which isn't divisible by 29.
  - 29*29 > 547.  So 547 is prime.

A fellow named Ping taught me this, but I don't remember his name.

Mon, 07 Nov 2005

Observations of BitTorrent behavior, with an oversimplified model
-----------------------------------------------------------------

(This varies a lot from torrent to torrent.)

On average, the number of seeders on a BitTorrent torrent is around
10% of the number of leechers, a number that gradually decreases until
the torrent dies.  This is not true for all torrents, but it's true of
a substantial number of them.

If you're using such a torrent to publish a file (such as a home video
or collection of multi-megapixel photographs) that you wish to remain
continuously available, you must operate your own permanent seed.
However, you'd like most of the file data in your torrent to travel
from one peer to another, rather than from your own seed to the peers.

To simplify, I assume that there are no incomplete downloads, and each
leecher becomes a seeder for some period of time after they finish ---
about 10% of the time they had been leeching.

As time goes on, the number of leechers in the torrent approaches the
time to download a complete copy of the file, multiplied by the
frequency of new requests; P, the number of total peers aside from the
permanent seed, is the number of leechers multiplied by a constant
around 1.1.

Whenever P is 1 or 0, all of the data transmitted in the system must
comes from your permanent seed, and BitTorrent degrades to a download
protocol similar to FTP or HTTP, but with better protection against
corruption of data in transit.

Whenever P > 1, non-permanent peers can exchange data with one another
(since they're online at the same time), and the load on the permanent
seed is lessened.  Indeed, if the number of other peers remains above
1 permanently, the permanent seed need never send out data blocks
(under the simplifying assumptions above.)

Where BitTorrent works well
---------------------------

Mostly BitTorrent has been used successfully to date with P in the
hundreds, sometimes with a permanent seed and sometimes without.  But
since P is proportional to the frequency of new requests and the file
size, but inversely proportional to the bandwidth to these peers,
BitTorrent has not been very successful at distributing small files or
those with only a few requestors at any given time.

Making BitTorrent work better through aggregation
-------------------------------------------------

One answer is multi-file torrents, or ISO torrents, in which a torrent
contains hundreds of megabytes of data belonging to many different
files.  This attacks the problem on two fronts: first, it increases
the number of requests for a particular torrent by aggregating demand
for many individual files into a single torrent, and second, it
increases the amount of time required to complete the download.

This may have problems when applied to very large files, however.

BitTorrent works much better than previous peer-to-peer file-sharing
systems for several reasons, of which the best known is that the
software embodies a social norm of reciprocation.  The most effective
way to get good service from a torrent is to send data to other
participants so that they will want to send data to you.  Therefore,
modifying your own copy of the software to improve your own service at
the cost of others is quite difficult.

(Most previous systems also suffered from not having "swarming
downloads" and from attempting to provide search and confidentiality
or deniability facilities.)

Suppose, though, you're downloading a 300MB OurMedia video, and you
find that it's packaged into a 10GB UDF filesystem, which will take up
US$10 of disk space when it finishes downloading, cost you about US$3
of bandwidth at US DSL rates to download, and a similar amount of
bandwidth to upload to others, and worst of all, it will take about a
day; while downloading the video alone would cost you only about
US$0.15 upload and US$0.15 to upload and take less than an hour.  You
might modify your BitTorrent client's piece-selection algorithm to
preferentially request the parts of the UDF filesystem that interests
you, and preferentially talk to peers who have it.

If everyone does this, the result would be rather as if they're
divided among 33 different torrents; each peer stays online for less
time, and other peers it will want to talk to come online less often.

Making aggregation more effective
---------------------------------

Suppose that instead of aggregating 33 OurMedia videos into a single
UDF filesystem, the parts of which can be sensibly requested
separately, we use an M-of-N secret-sharing scheme to encode the 33
videos into a file such that you must download the whole file, or
nearly the whole file, to recover any of the original videos from it.

Now we have aligned everyone's incentives properly, without making any
changes to the BitTorrent protocol or software --- simply by adding an
archiving/unarchiving step outside the purview of BitTorrent itself.
Nobody can gain any use from this torrent without downloading the
whole thing, or nearly so.

The drawbacks of aggregation
----------------------------

The aggregation approach is costly.  Even if everyone participating in
the earlier-mentioned 10GB torrent is interested in only one video,
they have to pay an aggregation tax of 9.7GB --- uploaded, downloaded,
and stored --- merely to provide incentives for others to cooperate
with them.  Economically, this is nearly a deadweight loss.  It can be
diminished slightly by aggregating related files instead of randomly
selected ones.

As a consequence, ISPs will wish their customers would download
smaller torrents, and pay-for-download services (PayPal me $2 and I'll
give you this 300MB file) will have a total cost advantage over
peer-to-peer downloading.

Other approaches to supporting small and unpopular files with BitTorrent
------------------------------------------------------------------------

Trickling
---------

When P = 1, the permanent seed controls the amount of time the client
is online by how fast it sends the requested file.  Consequently it
can lengthen the time to download the file to increase the likelihood
that P will increase as another peer comes online.  However, if there
are only a few ephemeral peers and they are talking to one another
quickly, each time the permanent seed sends the last block of the
file, the other peers will quickly communicate it among themselves and
soon go offline.  Still, this is a way to triple or quadruple your
effective bandwidth, which is worth something.

CacheLogic
----------

CacheLogic is a company that sells peer-to-peer caching devices to
ISPs.  The idea is that the ISP saves a copy of whatever BitTorrent
data passes through it, and serves it to you from their local copy
whenever you request it.  (I'm not clear on whether it merely
participates as a peer in the torrent or actually hijacks your
attempted connections to other peers.)

Payment
-------

MojoNation, and later MNet, had the idea that you would exchange some
sort of currency for services such as file downloads.  This allows the
reciprocity enforcement mechanism to extend beyond pairwise
relationships into group relationships, and to provide a measure of
otherwise incommensurable goods.  This would provide an incentive for
people to download files they didn't themselves want, so that they
could earn credit from other people who did want them, and then use
that credit to download things they did want.

I have written about this a bit on the CommerceNet zLab Wiki.

Sat, 05 Nov 2005

#!/usr/bin/python
"""Sort paragraphs in a text file according to a key defined by some
regular expression.  I built this so I could sort the books I wanted
to get according to which aisle they were shelved in.

"""

import re, sys

def paragraphs(afile):
    "Iterate over the paragraphs in a text file."
    current_paragraph = ''
    for line in afile:
        if line.strip() == '':
            if current_paragraph != '': yield current_paragraph
            current_paragraph = ''
        else:
            current_paragraph += line
    if current_paragraph != '': yield current_paragraph

def get_aisle(regexp):
    def _(para):
        mo = re.search(regexp, para)
        if mo: return mo.group(1)
    return _

def doit(infile, regexp):
    paras = list(paragraphs(infile))
    paras.sort(key=get_aisle(regexp))
    print '\n'.join(paras)

if __name__ == '__main__':
    doit(file(sys.argv[1]), len(sys.argv) > 2 and sys.argv[2] or r'isle (\d+)')

# ./aislesort.py ~/sdc1/kragen-pim/books 'by \w+ (\w+)'
# ./aislesort.py ~/sdc1/kragen-pim/books '_([\w ]+)_'
# ./aislesort.py ~/sdc1/kragen-pim/books '_(?:The )?([\w ]+)_'
# ./aislesort.py ~/sdc1/kragen-pim/books 'Recommended by (\w+)'
# ./aislesort.py ~/sdc1/kragen-pim/books '(\d\d\d\d-\d\d-\d\d)'
# ./aislesort.py ~/sdc1/kragen-pim/books '(\d+) pages'
# ./aislesort.py ~/sdc1/kragen-pim/books "(?i)(harpercollins|o'reilly|princeton|mcgraw-hill|sheffield hallam university press)"
# ./aislesort.py ~/sdc1/kragen-pim/books "ISBN\s+([-\d]+)"

Thu, 03 Nov 2005

Universal cellular automata
---------------------------

There's a rumor from Stephen Wolfram that certain one-dimensional
two-state three-cell-neighborhood cellular automata are
Turing-complete, in particular rule 110.  Here's a simple Python
script for running such 1-D CAs:

    #!/usr/bin/python
    import sys

    def display(line):
        sys.stdout.write(''.join([' #'[x] for x in line]) + '\n')

    def run(n=110, ng=1000, nc=400, display=display):
        cells = [0] * nc
        cells[nc/2] = 1
        for ii in range(ng):
            display(cells)
            newcells = [0] * len(cells)
            for jj in range(1, len(newcells)-1):
                index = 4 * cells[jj-1] + 2 * cells[jj] + cells[jj+1]
                newcells[jj] = n & (1 << index) and 1 or 0
            cells = newcells

    if __name__ == '__main__':
        run()

Supposing this rumor is correct, and rule 110 is indeed a universal
computer (which seems plausible even from the output from this
program), it's quite simple.  

Computing rule 110
------------------

The truth table, is 01101110, is as follows in a Karnaugh map:
      
          right (1's)      
          0    1           
                          
       center center  (2's)
        0  1  1  0   
                     
     1  0  1  0  1   
left                
(4's)0  0  1  1  1   


Hope I got that diagram right.  Here are some sum-of-products forms it
suggests:

~R & C  |  ~L & C  |  R & ~C     ( equivalently (R ^ C) | ~L & C )

~R & C  |  R & ~L  |  R & ~C     ( equivalently (R ^ C) | ~L & R )

I used the program at
http://lists.canonical.org/pipermail/kragen-hacks/2003-February/000365.html
(which takes the truth table backwards from the Python program and how
I've listed it above) and it finds a number of five-NAND-gate
solutions, the first of which is H = -(-(-(C*A)*B)*-(-(B*A)*C)):

          A: 00001111
          B: 00110011
          C: 01010101
   D=~(C*A): 11111010  (delay 1)
   E=~(D*B): 11001101  (delay 2)
   F=~(B*A): 11111100  (delay 1)
   G=~(F*C): 10101011  (delay 2)
   H=~(G*E): 01110110  (delay 3) (correct)

None of the five-NAND-gate solutions are shallower (i.e. faster) than
that three-propagation-delay item.

(I'm a little dubious about this because, when the truth table was
reversed, it needed 7 gates to find solutions, e.g. E = -(B*B); G =
-(E*-(A*A)); J = -(-(-(G*C)*G)*-(E*C)); so I wonder if the program is
buggy.)

A suggested embodiment: barely subcritical lasers
-------------------------------------------------

You could imagine that you might be able to find a physical system
that exhibits this precise behavior on streams of incoming encoded
bits.  For example, some funny-shaped laser.  It merely needs four
internal states, or three different inputs with slightly different
delay lengths.

Suppose you have a 10-meter spool of fiber-optic cable carrying
1-picosecond laser pulses, coming out of a smallish laser that's kept
very close to the population inversion threshold.  (It can't be too
much bigger than 0.3mm or it gets hard to generate such quick pulses.)
The spool acts as a delay-line memory containing about 33 kilobits of
information.  It feeds back into the laser in some unspecified way,
determining with its small energy input whether the laser is above or
below the population inversion threshold, in a way carefully designed
to produce the above truth-table on the last three bits fed into the
laser.  (It might be smart to encode bits in the timing of the pulse
rather than in its amplitude.  For example, as in the Internet-0
protocol, each bit-time could be divided into a zero half and a one
half, and a pulse could be present in either one half or the other.)

(This hypothetical computational machine takes advantage of an
often-lamented feature of current optical technology: communication
and storage are very easy, while logic and switching are very
difficult.)

So this hypothetical optical system is transforming bits at 1THz,
although it is in effect simulating 33000 machines running in parallel
at 30MHz each.

Femtosecond laser pulses have been demonstrated, so pumping up the
speed to a petahertz (and consequently the storage to 33 megabits)
isn't implausible.  A modern high-performance CPU has on the order of
100 million gates which change state on the order of 4 billion times a
second, so if it operated bit-serially, it would have to change bit
states at around 400 petahertz.  It's plausible that our hypothetical
femtosecond laser could perform useful computations either faster or
slower than a standard CPU.

I'm pretty sure someone else has done work using the chaotic behavior
of lasers close to the critical threshold for computation, but I can't
remember who it is.

Bucket brigades or pipelines provide linear scaling
---------------------------------------------------

If instead of routing the laser's output back to itself after some
delay, we routed it to another such laser, whose output was routed
back to the first, the bits would run through two generations on each
round trip.  This can be repeated arbitrarily, the limitations being
how many laser assemblies you can afford (with money and reliability)
and how close together you can put them.

Input and output
----------------

I/O is an interesting concern.  If the laser generates enough power
that we can spy on its memory state as it goes by, we can run it
through some other finite-state machine to look for certain patterns
that indicate "data to be output", and similarly "data to be input".
I don't have any experience programming useful computation on such a
machine at present, so I don't know what form this would take.

In an N-bucket bucket-brigade configuration, presumably the
opportunity for I/O would only happen in one place in the brigade, so
only one generation out of every N would have the opportunity to
participate in I/O.

Nuclear reactions as CA computational elements
----------------------------------------------

Nuclear reactions are another possible computational element that
could be used in this way.  If you can build some substance that has
four distinct states among its nuclei and that can be persuaded to,
for example, fission and emit more particles (ideally alpha particles
or other charged particles) when bombarded with incoming particles in
the correct sequence, you could store your machine state in a series
of pulses in a particle beam.

I don't know how fast nuclear reactions happen, although I've seen
abstracts (such as http://www.iop.org/EJ/abstract/0954-3899/23/10/024
"Near- and sub-barrier fission fragment anisotropies and the failure
of the statistical theory of fission decay rates", J P Lestone et al
1997 J. Phys. G: Nucl. Part. Phys. 23 1349-1357) suggesting they're on
the order of 10^-20 seconds; chemical reactions are on the order of
femtoseconds (10^-15 seconds) already, and I assume some nuclear
reactions are even faster, since they involve the same mass moving
much less distance.  If a single nuclear-powered cellular-automaton
engine could process 10^20 cells per second, it would process 10 000
times as many bits per second as a conventional semiconductor-based
machine; and there's nothing preventing 100 or 1000 of them from being
placed in a bucket brigade.  However, I suspect that each element
would have to be no more than 10^-20 seconds in diameter to avoid
fuzzing out the signal, which is around 3 picometers --- smaller than
most atomic diameters.

Current X-ray lithography techniques produce features of 45 nanometers
in size; that's only 1.5 * 10^-16 seconds across at the speed of
light, and there are existing lasers that produce pulses around 10^-15
to 10^-14 seconds.  So it might be practically difficult to achieve
these theoretical speed advantages of computing with nuclear
reactions.

Simulating 2-D CAs with 1-D CAs
-------------------------------

If we scan across a current state of a 2-D CA in a raster fashion,
feeding each cell state into such a 1-D CA as an additional input to
its state machine, we can perhaps program the 1-D CA to store the
current state of a couple of rows of the 2-D CA, and produce new
states of the 2-D CA at intervals, much as the Cytocomputer did.

Related work
------------

A lot of this posting recapitulates my "really simple computers"
kragen-tol posting from February 2003:
http://lists.canonical.org/pipermail/kragen-tol/2003-February/000738.html

The Cytocomputer, by the Environmental Research Institute of Michigan,
used a technique related to this for machine vision with
two-dimensional cellular automata, using a pipeline of ASICs, each of
which contained two rows of the image plus three cells.  Each clock
cycle, each stage would output the 8-bit state of the cell currently
being computed, discard the upper-left-hand corner of that cell's
9-cell neighborhood, and read the lower-right-hand corner of the next
cell's neighborhood from the previous stage.  See (expired) US Patents
4,665,554 and 4,665,551, by Stanley R. Sternberg, and related patents
for details.

There's an unpublished paper showing that an arbitrary number of delay
elements and an OR gate comprise a universal computer.  (In my
previous article, I wrote: Miles Murdocca, at Bell Labs, in an
unpublished paper, between 1983 and 1993, according to an article by
J. Storrs Hall in "Extropy", #10, vol. 4, no. 2, Winter/Spring 1993,
entitled, "Nanocomputers: 21st century hypercomputing."  I have a very
vague memory of corresponding with someone (perhaps Dr. Murdocca?)
afterwards on this, but unfortunately I cannot find any such
correspondence in my archives.  Probably bits and pieces of this work
are subconsciously embedded in the above.)