Mon, 30 Jun 2008

I use a sort of log-structured filesystem for my notebooks.  I fill
the notebooks in chronological order (more or less) from the second
page to the last page.  (The first page is left blank at first.)
Everything is under some heading; the current heading is repeated at
the top of every page, with the date, but sometimes there are several
headings on a single page.  The headings are underlined so they're
easy to see looking at the page.

So I can find things by paging through the recent pages and looking at
the headings.  When that gets to be too much, I append a new "table of
previous contents" section, under a heading just like everything else;
it lists all the headings, with dates, since the last "table of
previous contents".  The first page contains a list of tables of
previous contents, with their dates, so that I can find them
relatively quickly.  This allows me to find my notes more quickly by
reading through the few pages that are full of tables of previous
contents, rather than leafing through all the pages in the book
looking for headings.

If I were a disk, which I'm not, this would be a reasonably efficient
scheme for writes: regardless of how much stuff I have to write, I
could append it all in a single write to the end of the
currently-written data, possibly including a new table of previous
contents, then update the "superblock" on the first page with a
pointer to the new table.  So writing any amount of data less than a
notebookfull requires a seek to the end of the previous ToC, possibly
a read of data following it, a write of the new data, and possibly a
second seek and a second write to the superblock.  Two seeks.  Finding
something in a notebook with three ToCs requires at most four seeks:
one to each ToC, then another one to the data; if it's not listed in
any ToC, you can sequentially scan for it after the last ToC.

With this scheme, there's a tradeoff (for either humans or for disks)
between the amount of sequential scanning you may have to do (due to
still-unrubricated items) and the number of ToCs you may have to seek
to and read.

Beatrice pointed out the other day that it would be easier for a human
to write the notes sequentially from the beginning of the book, while
writing the ToC entries sequentially from the end of the book.  This
way, all the ToC entries are in a single sequential chunk, the
tradeoff between maximum sequential scan length and ToC fragmentation
is eliminated, and writing still requires only two seeks.

Of course she is correct, and this might be a reasonable strategy for
log-structured filesystems too, although there are usually more levels
of indirection: from superblock, through various levels of inodes and
directories, to the actual file extents on disk.  You could probably
do a reasonable job by putting a B-tree of pathnames at a fixed
location of the disk, and putting the inodes and data extents
contiguously somewhere else.  `/var/cache/locate/locatedb` is a
reasonable approximation of the contents of this B-tree; on my current
laptop, it's 5.3MB, indexing 95GB of files using 596 662 inodes
(i.e. 596 662 files, although `sudo locate / | wc -l` only finds 
494 488 files.).

Repacking a 5-20MB B-tree when it got too large and loose would take a
significant fraction of a second on a modern disk, but on my laptop
would take perhaps 10-20 seconds, due to the slowness of on-CPU disk
encryption.  So it might be better to defragment the tree incrementally.

Sat, 28 Jun 2008

I've been staying in Oakland, taking care of my cousin's house and
garden as he and his partner Becca visit family in New Mexico.

The other night, I ordered a carnitas burrito from a taco truck in a
parking lot at midnight, just after some Mexican-American teenagers
who talked to each other in English but ordered in Spanish.  They
breakdanced in the parking lot as other people, middle-aged men with
their toddler sons, young pregnant women with their young husbands or
boyfriends.  I felt very blessed to be there, under the flickery
yellow lights with the smell of many kinds of grilled meat wafting out
from the taco truck: lengua, carnitas, cabeza.  The burrito was
delicious; I ate it as I walked home.

***

Today, as I watered the garden, two young women drove up in a van full
of picnic supplies.  They turned out to be next-door neighbors I
hadn't met yet.  I was watering the green beans, occasionally munching
a succulent, sweet green pod.  When I initially said hello, they
didn't respond; I thought maybe they didn't speak English, so I told
them in Spanish how good the beans were and offered them some.  One of
them answered in English and accepted a pod, but didn't like it very
much.

***

Yesterday morning, I left their house at 7:30 so I wouldn't be late
for a meeting at HP Labs in Palo Alto at 10:00.  I stopped at the 16th
and Mission stop where I'd left Becca's bike the night before, with
both wheels and the frame locked to a parking meter, in between all
the other bicycles.  When I arrived, it was the only bike left; the
others had all left the night before.  Nothing was missing from it,
not even the pump and polyethylene water bottle.  But I missed my
connection at the 16th and Mission station, and I arrived at the
meeting at 10:45.

Unbeknownst to me, I had ruptured the rear inner tube riding it up to
HP Labs, and I had neglected to carry a tube repair kit with me ---
although there was one on the living room coffee table and one on the
shelf in the bedroom.  My friend Rohit gave me a ride to downtown with
the bike, where I bought tire levers and a patch kit, repaired the
tube, and broke the pump.

***

The other day, I wanted to make capresse sandwiches to share with my
friend Josh.  I picked fresh basil from the back yard and cut up some
tomatoes and mozzarella, but then discovered no bread.  But I had made
pancakes for breakfast with my friend Linley that morning.  So Josh
and I ended up having capresse sandwiches on cold pancakes made with
vanilla soy milk, on a rooftop plaza in the new San Francisco Public
Library building, accompanies with a garden salad from Ben and Becca's
garden, with nasturtium flowers, oxalis, arugula, purslane, and I
think a little mint, on top of some store-bought lettuce.

***

One day, on the way "home", I stopped at the 16th and Mission station
to buy a phone card.  The $5 La Leyenda card I bought has provided
about 45 minutes of talk time to Beatrice in Argentina over the course
of more than a week.  I called her immediately before leaving the
phone-card store and talked for 15 of those minutes, because we hadn't
heard each other's voices in days.

This is the kind of thing I wish everybody could write for themselves.
It automated a simple, repetitive task that Beatrice was spending a lot
of time on.  I wrote most of it in a few minutes, and then she finished
it.

#!/usr/bin/perl -w
use strict;
# script to help beatrice with her web pages
# comments for perl newbies

# ARGLEBARGLE marks the end of $stylelinks
# "my" creates a new variable
my $stylelinks = <<ARGLEBARGLE;
     <link href="../css/global.css" rel="stylesheet" type="text/css">
     <link rel="shortcut icon" href="../../paisley.ico">
ARGLEBARGLE

# similarly with SIMILARLY
my $sidebar = <<SIMILARLY;
<ul class="first"> 
	<li><a href="../animals/index.html"><h2 localizable="true">Animals</h2></a></li>
	<li><a href="../architecture/index.html"><h2 localizable="true">Architecture</h2></a></li>
	<li><a href="../light/index.html"><h2 localizable="true">Light</h2></a></li>
	<li><a href="../macros/index.html"><h2 localizable="true">Macros</h2></a></li>
	<li><a href="../performances/index.html"><h2 localizable="true">Performances</h2></a></li>
	<li><a href="../plants/index.html"><h2 localizable="true">Plants</h2></a></li>
	<li><a href="../../index.html"><h2 localizable="true">Home</h2></a></li>
SIMILARLY

sub mogrify_file {
  my ($input_filename, $output_file) = @_;
  # creates a variable $file, opens the file for input (<), and sticks
  # the open file in $file
  open my $file, "<", $input_filename or die "Can't open $input_filename: $!";

  # <$file> reads a line from the open file and sticks it in $_
  while (<$file>) {
    if (/<h2/) {              # /foo/ looks for "foo" in $_, returns true if found
      print $output_file $sidebar;
    } elsif (/rel="stylesheet"/) {
      print $output_file $stylelinks;
    } elsif (/CC-Attribution/) {
      print $output_file ' <p localizable="true"><a href="http://creativecommons.org/licenses/by-sa/3.0/">CC-Attribution 3.0</a> Beatrice Murch</p>';
    } else {
      print $output_file $_;
    }
  }
}

sub dirname {
  my ($filename) = @_;
  # chop off slashes followed by zero or more (*) nonslashes ([^/]) at
  # the end ($)
  $filename =~ s|/[^/]*$||;
  return $filename;
}

die unless (dirname "fixed/foo/bar/baz.html") eq "fixed/foo/bar";

sub ensure_dir_exists {
  my ($filename) = @_;
  my $dirname = dirname($filename);
  # invoke shell command mkdir -p fixed/foo/bar
  system "mkdir", "-p", $dirname;
}

sub create_new_file_from {
  my ($filename) = @_;
  my $newfilename = "fixed/$filename";
  ensure_dir_exists($newfilename);
  open my $output, ">", $newfilename or die "Can't open $newfilename: $!";
  mogrify_file($filename, $output);
}

die "no files" unless @ARGV;
for my $file (@ARGV) {
  create_new_file_from($file);
}

Thu, 26 Jun 2008

(This is available in HTML at
<http://canonical.org/~kragen/html-succinct.html>.)

HTML is more succinct for things in its intended domain than
S-expressions, but still has better error-detection and correction
capabilities.

S-expression fans like to say that HTML, SGML, and XML are just
bastardized S-expression languages.  SGML partisans often respond that
matching end-tags allow for better error-reporting and correction.
But for typical HTML content --- mostly running text with a little bit
of interspersed markup --- S-expressions are not only harder to
correct, but also more verbose.

Consider this partial paragraph from the Ur-Scheme web page
<http://pobox.com/~kragen/sw/urscheme>:

    <li><b>Reasonably fast.</b> It <b>generates reasonably fast
    code</b> &mdash; when compiled with itself, it runs 2½ times
    faster (in user CPU time) than when it's compiled with <a
    href="http://www.call-with-current-continuation.org/"
    >Chicken</a>, 1½ times faster than when it's compiled with...</li>

Now, in traditional HTML, I could have left out the quotes around the
URL and the ending `</li>` tag.  Consider this S-expression version:

    (li (b "Reasonably fast.") " It " (b "generates reasonably fast
    code") " " mdash " when compiled with itself, it runs 2½ times
    faster (in user CPU time) than when it's compiled with "
    (a :href "http://www.call-with-current-continuation.org/"
    "Chicken") ", 1½ times faster than when it's compiled with...")

Most of the markup constructs take up more characters here:

    LI: '<li></li>'    (end tag could be omitted in traditional HTML)
        '(li "")'
    B:  '<b></b>'
        '(b "") '
    B:  '<b></b>'      (the second one)
        '" (b "") "'
    --- '&mdash;'
        '" mdash "'
    A:  '<a href=""></a>'  (quotes could traditionally be omitted)
        '" (a :href "" "") "'

If you look at this in a fixed-width font, you'll see that the number
of markup characters is detectably smaller in the S-expression
serialization of the structure, with the exception of the first two.
I maintain that this is typical of the bulk of HTML, especially if you
weight it by how often people write it instead of how often it gets
sent to browsers.  You can come up with examples where that is not the
case:

    <html><head> <title>...</title>
                 <link rel="stylesheet" href="../../style.css" />
                 <meta http-equiv="Content-Type" content="..." />
                 <style type="text/css">...</style></head>...</html>

vs.

    (html (head (title "...") (link :rel "stylesheet" :href "../../style.css")
                (meta :http-equiv "Content-Type" :content "...")
                (style :type "text/css" "...")))

but those structure-heavy, text-light examples with long-winded tag
names are relatively rare for people to read and write.

Of course, the cost of terser syntax is often that errors are hard to
diagnose.  Ada's `end loop`, `end if`, `end record`, and so on mean
that if you leave out an `end` delimiter, the compiler will usually be
able to tell you which one you left out.  At the opposite end of the
spectrum, S-expression languages in which all the various kinds of
`end` are spelled as `)` can only tell you when they get to the end of
the program or to something that doesn't make sense in the current
context.

> This is not a phenomenon limited to end-delimiters.  In
> programming languages, there are many other examples of verbosity
> that helps to diagnose errors; for example, explicit type
> declarations, mandatory delimiter characters (in cases where the
> syntax would be no more ambiguous if they were removed from the
> grammar), sequences of single-line comments, and the conventional
> parenthesization of the arguments of fixed-arity functions ("ratio
> square sin x square sin y" is perfectly unambiguous, after all,
> and Forth, PostScript, Logo, and REBOL use more or less that
> syntax.).

However, in the case of HTML, the terser syntax does not make errors
harder to diagnose; in fact, the HTML syntax permits better
error-detection and even error-correction, because all of the end-tags
are explicitly labeled.  (It differs from SGML in this regard; in
SGML, you can write `<li><b/Reasonably fast./ It ...</>` and eliminate
the redundant end-tags altogether.)

Tue, 24 Jun 2008

I came here because of a death; my friend Eric died a couple of months
ago, and I came for his memorial service a week ago.  I've been spending
the time since then appreciating all the people who aren't dead yet.

Today was the day I had planned to fly back to Argentina, but
unfortunately a number of bureaucratic obstacles have lifted themselves
up in my path.  I could go back to Argentina, but I would probably have
to return to the US to deal with them.  So my departure is delayed until
July 4th.

I have had a wonderful time visiting friends and family here.  Every day
I see people I love whom I hadn't seen since last year, and it is
wonderful.  

But my time has been fairly full.  I've been very lucky in that friends
and family have lent me a house, a laptop, a bicycle, and a cell phone
while I'm here; without these, this level of activity would be pretty
difficult.

Some notable recent days:

Saturday: I went to Bolinas to get the Magic Bus; we think selling it in
San Francisco will be easier than selling it in Bolinas.  It certainly
won't be able to sell for the amount of money we've put into it (US$2200
of work late last year, US$800 or so when I rebuilt the engine, US$2000
to tow it across the country, etc. etc., plus the US$4500 that was its
price when we first got it.)  But maybe we can get some fraction of that
money back.

Monday: I made breakfast for one friend, lunch for another, visited the
California Department of State, went shopping in Chinatown, biked
several miles uphill, and traveled to Pleasanton on BART.

Last Tuesday: Said goodbye to my cousin who's lending me his house, met
a friend in Berkeley for breakfast, visited another friend to see her
lab and pick up the cell phone she was lending me, rode over to San
Mateo with the first friend, visited a company I used to work for, got a
phone card to call Argentina with, met a third friend for dinner in
Berkeley, went to a meeting of some friends in San Francisco to
incorporate a nonprofit (shaving with a dry razor as I walked down the
street to get there), picked up keys to a friend's apartment nearby, and
picked up groceries for breakfast the next morning as I walked back to
BART.

I've somehow managed to keep my expenses relatively reasonable while
doing this.  As of Saturday, my average since arriving in the US had
been US$14.14 per day, about 75% of which had been on public transit.  I
suspect it's gone up since then, largely because of the Magic Bus.
Already, though, that's the same as the rent on our apartment in Buenos
Aires.

Some time this week I will need to drive to Modesto and look through a
storage unit for bureaucratic reasons, which is generally an ordeal in
the summer.  I am hoping I can find an early-rising friend or two to
join me.

Aristotle:

Thank you for the explanation.  What you wrote makes sense.  The  
papers that I referenced [1] didn't analyze the consequences of data  
loss failures, only the mechanism (if any) that the filesystem code  
has for detecting and responding to failures.  Those papers seem to  
suggest that reiser3 is better than the others (but not perfect) at  
the goal of detecting more possible failures, handling them in a fail- 
safe manner (thus trading off availability to gain correctness) and  
handling a variety of failures in a consistent way.  However, if  
there is an error which isn't detected by the filesystem code (such  
as a silent mis-write), then I can see how the less redundant reiser3  
data structure is more brittle.

Thanks again.

> And for my own proclivities, reiserX goes too far toward the
> performance end of the scale.
...
> Hence my general dislike of reiserX.


Here are some personal proclivities of my own:

I don't like ext3.  It seems to be engineered "ad-hoc".  The recent  
revelations that (a) it turns write-barriers off by default, (b)  
nobody knows how much performance delta we're talking about, (c)  
nobody knows how much safety delta we're talking about, is just the  
latest detail to make me think that ext3 is engineered primarily by  
ad-hoc response to complaints (or by ad-hoc improvements).  I won't  
go into more detail.

I like reiser3 in general for the reasons that I listed above.   
However, "in general" is probably not the right way to choose a local  
filesystem.  Rather, the specific use case probably makes all the  
difference.  For Tahoe LAFS [2] storage servers, reiser3 is probably  
a good choice because it runs on Linux, is fast, and packs small  
files for better space efficiency.  Tahoe does not rely on local  
filesystems for data correctness or longevity, so the chance of data  
corruption or loss isn't that important of a criterion, but reiser3's  
tendency (mentioned above) to fail loudly and fail-stop is probably  
better operationally than the alternative of grinding along quietly  
while losing or corrupting data or suffering reduced performance.   
Also the fact that other large data-farming operations like the  
Internet Archive, Mozy, and EMC Centera [4] have used reiser3  
extensively gives me confidence.  Finally, the fact that reiser3 is  
old and does not get tweaked or improved is reassuring -- the worst  
failures we're likely to encounter are new bugs or new filesystem  
"improvements" that we didn't understand.

I like ZFS, and I'm happy using it on (Free Software, Open Source)  
Solaris.  My web server, http://zooko.com is running Nexenta [3]  
which uses ZFS by default.

I like BTRFS, and furthermore I predict that it will be a huge  
success in a few years because (as Andy Isaacson showed me), you can  
upgrade your ext3 filesystem in place to BTRFS, and even revert it  
again to the state that it was in before you upgraded it to BTRFS.  I  
think the main reason that ext3 is the de facto standard nowadays is  
because ext3 was so data-backwards-compatible with ext2, and since  
BTRFS is highly data-backwards-compatible with ext3 (as well as  
having many other great features, as well as being architected by  
Chris Mason who was responsible for much of the good stuff in  
reiser3, as well as being funded and supported by Oracle), then it is  
sure to be a winner.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/wiki/Bibliography#LocalFilesystems
[2] http://allmydata.org
[3] http://nexenta.org
[4] http://lkml.org/lkml/2008/5/18/260

* zooko <zooko at zooko.com> [2008-06-24 02:30]:
> On Jun 23, 2008, at 1:50 PM, Aristotle Pagaltzis wrote:
>> The problem with the Reiser family FSs is that they are
>> inherently brittle. Now that they have been sufficiently
>> debugged, they no longer lose data often, but if you have even
>> a small unrepairable corruption, it is still more likely that
>> you’ll lose half your disk instead of just a few files, as is
>> the extX family’s failure mode.
>
> How do you know this? It doesn't seem to be implied by any of
> the papers that I referenced, but nor is it contradicted by
> them. I would like more data.

Purely from my own reasoning.

In extX, inodes, directories and the alloc bitmap are all
separate, and two of them are randomly accessible linear data
structres. Actually in some important ways even directories are.
There are a few crucial bits of metadata about these data
structures that, if destroyed, would preclude you from finding
them at all (eg. the superblock and such), but those are not
written to during normal operation. If you lose a directory,
the inodes are still there so you lose the tree structure but
none of the contents; if the bitmap is affected, as long as you
notice the inconsistency you lose nothing (so it’s really just
a cache); if you lose inodes, only the files described by the
affected inodes are lost. It’s simply impossible to do much
non-localised damage because the metadata layout has such low
entropy.

Of course that’s also a big reason why it’s impossible to make
extX fast for operations involving a lot of metadata.

In constrast, reiserX mediates all metadata through a Btree. If
you lose any subtree, the entire information about that subtree
becomes unreachable. You can use a carving-type tool and some
heuristics to try to find the metadata after the fact and restore
it as well as possible, but your chances are still mediocre. This
is how reiserX gets its phenomenal speed, of course – every bit
of metadata read from the disk helps avoid having to read more
metadata. Entropy is very high. That’s also the reason for it’s
sky-high CPU cycle consumption.

But it does mean that it is inherently brittle, because you need
all of the participating metadata to get at any piece of data,
whereas in extX a lot of the participating metadata only serves
as middle men providing indirection.

This is an information-theoretically rooted tradeoff. It is
mathematically impossible to make a filesystem both extremely
robust and extremely fast, because those properties lie at
opposite ends of the redundancy scale.

And for my own proclivities, reiserX goes too far toward the
performance end of the scale. At the same time I don’t think
extX is the be-all end-all on its part of the scale; I think
it is entirely posssible to achieve robustness at least close
to that of extX without having to accept nearly as limited
performance.

Hence my general dislike of reiserX.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Mon, 23 Jun 2008

On Jun 23, 2008, at 1:50 PM, Aristotle Pagaltzis wrote:

> The problem with the Reiser family FSs is that they are
> inherently brittle. Now that they have been sufficiently
> debugged, they no longer lose data often, but if you have
> even a small unrepairable corruption, it is still more likely
> that you’ll lose half your disk instead of just a few files, as
> is the extX family’s failure mode.

How do you know this?  It doesn't seem to be implied by any of the  
papers that I referenced, but nor is it contradicted by them.  I  
would like more data.

Regards,

Zooko

* zooko <zooko at zooko.com> [2008-06-23 20:35]:
> I'm not sure what the use case we're talking about is,
> but for many use cases reiser3 is a good choice.

The problem with the Reiser family FSs is that they are
inherently brittle. Now that they have been sufficiently
debugged, they no longer lose data often, but if you have
even a small unrepairable corruption, it is still more likely
that you’ll lose half your disk instead of just a few files, as
is the extX family’s failure mode.

So even now I would not entrust either Reiser-family FS with data
that I care about. Hence my suggestion of putting the less huge-
tree-performance sensitive git store on another partition and
keeping only the working copy on reiserX.

The other thing about the Reiser-family FSs is that they eat
enormous amounts of CPU, to the point that they can *fully peg*
a CPU during I/O.

> Andrew Morton vaguely recalled something like 30% performance
> loss when turning on write barriers for ext3.

Yes, ext3 is slow. It is much faster than the Btree-based FSs in
a select few circumstances, but in general, it is slow or very
slow.

> Chris Mason (who worked on reiser3 at the time and is I think
> chief architect of btrfs now) cooked up a test script which
> could cause filesystem corruption in ext3 with about 50%
> probability in case of power loss.

Right, but how much data gets lost in such a case? In my
experience and that of every sysadmin I’ve asked, unrecoverable
corruption is essentially always quite localised with extX.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

On Jun 23, 2008, at 7:15 AM, Aristotle Pagaltzis wrote:

> I guess it’s down to JFS…

I'm not sure what the use case we're talking about is, but for many  
use cases reiser3 is a good choice.  Here are a few links to papers  
which analyzed various filesystems (including reiser3, JFS, XFS,  
ext3, and NTFS) and which altogether suggest that reiser3 is better  
engineered for data correctness (at the expense of availability) than  
most:

http://allmydata.org/trac/tahoe/wiki/Bibliography#LocalFilesystems

Also note that I recently learned that the benchmarks that we've  
looked at over the years were mostly done with write barriers turned  
on by reiser3 and write barriers turned off by ext3:

http://lwn.net/Articles/283161

Andrew Morton vaguely recalled something like 30% performance loss  
when turning on write barriers for ext3.

The thread is interesting to follow.  Chris Mason (who worked on  
reiser3 at the time and is I think chief architect of btrfs now)  
cooked up a test script which could cause filesystem corruption in  
ext3 with about 50% probability in case of power loss.

I agree that ZFS and btrfs are interesting alternatives, and I also  
remain interested in reiser4.  Note that you can use ZFS today on a  
Free Software operating system -- just install OpenSolaris.  (I use  
Nexenta -- Solaris with apt-get.)

Regards,

Zooko

On Mon, Jun 23, 2008 at 04:15:48PM +0200, Aristotle Pagaltzis wrote:

> Wow. What use is journaling when a system crash leaves your
> filesystem corrupted anyway?

Yes, it is rather unfortunate. But if your system doesn't panic/lock up
and has an UPS xfs works well.
 
> I guess it’s down to JFS… except that JFS reportedly performs
> poorly with voluminous trees (although still better than ext3,
> as far as I understood).
> 
> So maybe in the end the conclusion is to use reiser4 for the
> working directory and keep a regularly-repacked git store on
> a more robust filesystem. (You can set the `GIT_DIR` environment
> variable to tell git commands where to look for it.)
> 
> That doesn’t address the issue that though fast, both reiserfs
> and reiser4 are very CPU-hungry, though.
> 
> Pity that btrfs is basically just an alpha even now…

I'm hoping for a GPL-compatible zfs license in a year or two.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

* Eugen Leitl <eugen at leitl.org> [2008-06-23 11:50]:
> When using XFS make sure your system is backed up by an UPS,
> and doesn't crash. I wouldn't use it as a root filesystem
> otherwise.

Wow. What use is journaling when a system crash leaves your
filesystem corrupted anyway?

I guess it’s down to JFS… except that JFS reportedly performs
poorly with voluminous trees (although still better than ext3,
as far as I understood).

So maybe in the end the conclusion is to use reiser4 for the
working directory and keep a regularly-repacked git store on
a more robust filesystem. (You can set the `GIT_DIR` environment
variable to tell git commands where to look for it.)

That doesn’t address the issue that though fast, both reiserfs
and reiser4 are very CPU-hungry, though.

Pity that btrfs is basically just an alpha even now…

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

On Sat, Jun 21, 2008 at 10:58:52AM +0200, Aristotle Pagaltzis wrote:

> Both suggest that you may want to consider XFS or JFS instead.

When using XFS make sure your system is backed up by an UPS, and
doesn't crash. I wouldn't use it as a root filesystem otherwise.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

(Available in HTML at <http://canonical.org/~kragen/wood-pda.html>.)

Polished-Stone Handheld Computers
---------------------------------

So I've been thinking about making a handheld computer with the look
and feel (shininess, irregularity, weight, seamlessness) of a polished
semiprecious stone.

One way to do this would be to embed the electronics in polyester
resin poured into a mold, with an embedded induction coil for
charging, some embedded lead shot for weight, and a dark, but not
quite opaque, surface layer to hide the interior except for when it
was glowing.  Input would probably be piezoelectric, localizing
surface taps or using rhythm.  (See earlier kragen-tol post [magic
boxes and secret knocks][magic].)  Output could be through embedded
LEDs shining through the surface layer or through audio, especially if
you held it against a window.

(How much lead shot would you need?  Lead has a density of 11.3g/cc,
against quartz's 2.6g/cc and [the polyester resin's 1.11g/cc][EP4117],
so only 14.6% of the volume would need to be lead to equal quartz's
density.)

It would be shockproof, waterproof, crushproof, not particularly prone
to damage from ESD, and it would feel really good in your hand.  Some
hard silicone around the outside might improve its thermal
conductivity.  (There are hard silicone resins with high thermal
conductivity, right?)

Beatrice suggested that you could use an actual polished semiprecious
stone instead; cut out a circle from one side, drill out a cavity
underneath, put the electronics inside, pot them with epoxy, replace
the circle, wipe off the excess epoxy, and then polish the result.

Wood-Block Handheld Computers
-----------------------------

Another "everyday object" kind of electronic device case: a block of
wood.  Some time ago I saw a web page about a wooden clock.  It seems
to be widely available now; for example,
<http://svp.co.uk/products-solo.php?pid=4989&ref=froogle&ci_src=18615224&ci_sku=8028>
advertises it for £93.99.  It explains:

> A totally minimal block of wood with digital numbers floating
> across the surface. These clever clocks have a very thin layer of
> real maple wood veneer that permits the LEDs to shine through.
> 
> Each one is slightly different due to the natural variation in
> wood grain.
> 
> Dimensions: 208 x 90 x 90mm
> Weight: 1.2kg

Another page says:

> TO:CA 'wood' LED clock designed by kouji iwasaki in 2002. this
> 'wooden' LED clock won top prize at the asahikawa international
> design fair in 2002.

A third page says they're actually made of MDF under the maple veneer,
and has a photograph of the back that seems to confirm this, and a
fourth page says the manufacturer is "Takumi of Japan".

I think a handheld computer that looks like a block of wood would be
pretty nice too.  Something the size of a business card (3.5" x 2", or
89 x 51 mm) but fairly thick (say, 15mm), with veneer on at least one
side.  The resolution of the display would be limited by the light
blurring on the way through the translucent veneer; each spot of light
would have a radius on the order of the thickness of the veneer.
[Veneers are typically 0.8mm][veneers] but are available as thin as
0.3mm.

If spaced 1.6mm apart, you could get almost 1800 pixels in a
rectangular array into the business-card size.  You could do a little
better with a hexagonal array: if the distance from the center of a
regular hexagon to the center of one of its sides is r, then the
distance to one of its corners is about 1.15r, which is the same as
the length of each side; and its area is 1½ * 1.15r * 2r = 3.45r²,
which is 14% smaller than a square circumscribed around the same size
of circle.  In the case of r=0.8mm, you'd have 2.2mm² per pixel
instead of 2.56, so you'd get about 2000 pixels.  But then you'd have
to deal with the hexagonal array in your software.

1800 pixels is enough for about 45 letters in a traditional 5x8
single-bit-deep font, which is pretty cramped; my cheap two-year-old
US$30 cellphone has something like 65 letters'worth of space on its
display.  But it's enough to be useful.  It's a lot more than any of
the under-US-$10 devices I picked up for the "[cheap electronics
dissection project][electronics]" in 2006, and they are useful for
some things.

I don't know how easy or hard it is to populate a PC board with
1800-2000 LEDs.  I know I wouldn't want to do it by hand.

You could hollow out the middle of a block of wood with just a drill
and jigsaw; a keyhole saw or wire saw might work in place of the
jigsaw.  Cutting all the way through it would be a lot easier than
just chiseling out a hollow in one side of the block; then you'd need
to put veneer on both sides instead of just one.  To add strength and
keep it from sounding hollow, you'd probably want to pot the whole
interior with epoxy or something.

You could have a couple of finishing nails visible on one end if you
wanted to charge it through actual electrical contacts rather than
with induction.

Other Everyday Items
--------------------

You could also embed handheld computers in the following: oyster
shells; bricks; pens (I suggested this previously on kragen-tol);
ceramic tiles; beanbags, pillows, and stuffed animals (like the Chumby
and the Furby).

References
----------

[EP4117]: http://www.eagerplastics.com/4117.htm "Eager Plastics EP4117"

Eager Plastics, aka Eager Polymers, has an "[EP4117][] General Purpose
Polyester Laminating Resin" with a density of 1.11 g/cc.

[magic]: http://lists.canonical.org/pipermail/kragen-tol/2002-April/000700.html

In April 2002, I posted "[magic boxes and secret knocks][magic]" to
kragen-tol. 

[veneers]: http://www.diyinfo.org/wiki/Using_Veneers "an article on DIYinfo.org"

The article [Using Veneers][veneers] describes the different kinds of
wood veneers available today.

[electronics]: http://courageous.murch-sitaker.org/~kragen/electronics/

In 2006 I wrote a web page about my "[cheap electronics dissection
project][electronics]", where I bought a bunch of cheap electronics
and looked inside them.

Sat, 21 Jun 2008

Yes, you should try that experiment on reiserfs a.k.a. reiser3.   
Also, if you patch your kernel to support reiser4 and try that then I  
would be happy to learn your results.

Regards,

Zooko

* Kragen Javier Sitaker <kragen at pobox.com> [2008-06-21 09:40]:
> I probably ought to try running this on murdererfs and see if
> it performs better; after all, this is the kind of thing it's
> made for, right?

This 2006 comparison of filesystems suggests that it’s not quite
up to fame: http://www.debian-administration.org/articles/388

http://fsbench.netnation.com/ dials back on that a bit; but the
fact remains that convictfs knows no shame when helping itself to
your clock cycles and memory circuits.

Both suggest that you may want to consider XFS or JFS instead.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

I wanted to experiment with having one mail message per file, but I'm
sad to report that, on my kernel, ext3fs becomes unusably slow with
random access to hundreds of thousands of files, even with
`dir_index`, even if the branching factor is only 100 files per
directory.  (I'm using Linux 2.6.18-6-686 from Debian Etch with
dm-crypt filesystem encryption on a year-1999 700MHz PIII laptop.
Maybe it wouldn't be a problem with a real computer, or a more recent
kernel.)

An earlier version of this code was responsible for the directory with
800 000 files that I mentioned in a previous kragen-hacks post which
included a shell script to remove that directory.  (I had given up on
`rm -rf` after 8 hours.)  This version doesn't stress the filesystem
quite as badly.

What I had in mind was to try checking my mailbox into Git, to get
better compression, faster syncing, and automatic detection and
correction of data corruption.  The 1.4 version of Git I have doesn't
seem to perform reasonably on the task, but maybe that's because the
underlying filesystem is sucking rocks.

I probably ought to try running this on murdererfs and see if it
performs better; after all, this is the kind of thing it's made for,
right?

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

#!/usr/bin/python
import sys, os

def makedirs(name):
    try: os.makedirs(name)
    except OSError, e:
        if e.errno == 17: pass
        else: raise

class Output:
    def __init__(self, dirname):
        self.dirname = dirname
        self.counter = 0
        self.go_to_new_file()
    def go_to_new_file(self):
        self.counter += 1
        dirname = '%s/%02d/%04d' % (self.dirname,
                                    self.counter % 100,
                                    self.counter % 10000)
        filename = '%s/message-%s' % (dirname, self.counter)
        makedirs(dirname)
        self.fo = file(filename, 'w')
    def write(self, data): self.fo.write(data)
    def close(self): self.fo.close()

def split(mbox, output):
    for line in mbox:
        if line.startswith('From '):
            output.go_to_new_file()
        output.write(line)
    output.close()

if __name__ == '__main__':
    split(sys.stdin, Output(sys.argv[1]))

Thu, 19 Jun 2008

So I just upgraded to Emacs 22 in April, despite Debian Etch not
supporting it.  It solves several of my daily annoyances with Emacs
21:
- It recognizes "Password: " as a password prompt, so ssh and sudo get
  the benefit of me not having to manually type M-x send-invisible.
- I can paste Unicode text into it from a web browser, including
  asymmetrical quotes, real apostrophes, and em dashes, and have it
  save them to a UTF-8 file without fuss.  (Although it still displays
  the quotes in an obnoxious double-width fashion until the file has
  been saved and reloaded.)
- TRAMP works out of the box.
- The documentation is included, unlike in Debian.  (There's a
  licensing dispute over whether the GNU Free Documentation License is
  free enough to satisfy the Debian Free Software Definition.)
- comment-region now asks what comment syntax to use if it doesn't
  know.
- When I run e.g. "darcs" by itself in shell-mode, occasionally Emacs
  used to take quite a while to display its output usage message,
  because it was reading it one character at a time.  This has been
  fixed.

I also anticipate joy using MuMaMo, but I haven't actually tried that
yet.

There are some changelog/news entries that sounded pretty good:
    ...if you set `set-mark-command-repeat-pop' to t.  I.e. C-u C-SPC           
    C-SPC C-SPC ... cycles through the mark ring.  Use C-u C-u C-SPC
    to set the mark immediately after a jump.  [Haven't tried this yet.]

    ...M-% typed in isearch mode invokes `query-replace' or
    `query-replace-regexp' (depending on search mode) with the current
    search string used as the string to replace.  [Haven't tried this
    yet.]

    You can now customize the use of window fringes.  To control this
    for all frames, use M-x fringe-mode or the Show/Hide submenu
    of... [so now I can have two 80-column windows on my screen at
    once, which is awesome]

    A new minor mode `next-error-follow-minor-mode' ... In this mode,
    cursor motion in the buffer causes automatic display in another
    window of the corresponding matches, compilation errors,
    etc. [Haven't tried this.]

    The new command `multi-occur' is just like `occur', except it can
    search multiple buffers. [Useful. Also I didn't know about
    `occur`.]

    The grep commands provide highlighting support. Hits are fontified
    in green, and hits in binary files in orange.  Grep buffers can be
    saved and automatically revisited.  [This is in fact extremely
    awesome.]

    In addition, when ending or calling a macro with C-x e, the macro
    can be repeated immediately by typing just the `e'. [This sounds
    nice, but the F3 and F4 macro keybindings are better.]

    The new package longlines.el provides ... "soft word wrap" [like
    actual word processors have since the 1970s.  Turns out to be
    fantastic.]

    SES mode (ses-mode) is a new major mode for creating and editing
    spreadsheet files.  [Haven't tried this yet.]

    The new package table.el implements editable, WYSIWYG, embedded
    `text tables' in Emacs buffers  [Haven't tried this yet.]

    The new package flymake.el does on-the-fly syntax checking of
    program source files.  [Haven't tried this yet.]

    savehist saves minibuffer histories between sessions.  [Haven't
    tried this yet.]

    isearch in Info uses Info-search and searches through multiple
    nodes.  [This is fantastic.]

    Atomic change groups: To perform some changes in the current
    buffer "atomically" so that they either all succeed or are all
    undone, use `atomic-change-group' around the code that makes
    changes.  [Sounds like a fantastic idea, but I haven't tried it
    either.]

So far I've only noticed two new annoyances: one is that it uses its
own python-mode that I don't like as well as the one that comes with
Python, and the other is that C-x C-f RET no longer reverts the file
to the version in the filesystem (assuming the buffer wasn't edited);
now you actually have to type the filename.

The stuff in the NEWS file (C-h N) looks pretty innocuous.  Nothing is
terribly exciting, though.

Mon, 16 Jun 2008

Here are the top few things I learned about git in the first few hours
I used it.  This is the document I wished I had had, on top of the
various introductions floating around.  Maybe it will be useful to
somebody else.

0. Git handles 400MB of HTML crawl data less gracefully than it
   handles 700K of Python.  But it handles that data more gracefully
   than `cp` and `rsync` do.

1. Don't `git push` to a repository that actually has a work area.
   Always use `git pull` instead.  `git push` doesn't update the
   associated working area, or the index either, so if you try to `git
   commit` in that repository, you will commit a patch that undoes all
   the stuff you just did.  See
   <http://utsl.gen.nz/talks/git-svn/intro.html> section "Push changes
   and the working copy".  You can solve this with `git reset --mixed
   HEAD`, or eventually `git reset --hard HEAD` to throw away any
   changes in the working area.

       23:05 < johnw> $ rsync -av .git/ server:/tmp/foo.git/ ; cd /tmp ; git clone ssh://server/tmp/foo.git
       23:06 < johnw> that's all you need to setup a remote repository, and to start using it right away

2. `git repack -a -d -f` can achieve some truly astonishing
   compression ratios.  This is how you make git checkouts faster than
   cp -a or rsync.  In my case, three times faster than rsync over a
   slow network, due to a 7:1 compression ratio.

3. You have to `git add` changed files before you can `git commit`
   them, or use `git commit -a`, because `git commit` commits things
   from the index, not your work area.  In older versions of Git, you
   used `git update-index` instead of `git add` on changed files.

4. `git commit` takes an option `--amend` which lets you amend
   previous commits.

5. `git clone -l` makes a hardlinked clone.

6. git has early-stage support for something called "submodules" in
   recent versions, similar to svn:externals.  And there's an
   in-development `git hunk-commit` command that might end up in git
   someday that should add most of Darcs's UI niceness to git.
   <http://raphael.slinckx.net/files/git-darcs-record>

I took the first 125 563 056 bytes of my mailbox and compressed them
into 59M with git.  However, git (1.4) doesn't seem to work very well
with multi-gigabyte quantities.

If you're using the Git 1.4 from Debian Stable, you'll want to know to
use `init-db` instead of `init`, `repo-config` instead of `config`,
and often `update-index` instead of `add`.

Sat, 14 Jun 2008

All the timing notes here are from my 700MHz laptop with filesystem
encryption.

I did this in Perl because Python doesn't have a `readdir` that lets
you iterate over the files in the directory one at a time; it only has
a `listdir` that constructs a list of all of them in memory and
returns it.  Maybe that would have been fine.

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

# wrote this to clean up a directory with 500 000 files in it.  After
# 8 hours of rm -rf, rm had only cut it down from 800 000 to 500 000
# files.

# without the unlink, this took only 99 seconds to run over all 
# 500 000 files.

# with unlink, it got through 18479 files in 3m2.7s, so that's
# actually 100 files per second, so it should be done after 5000
# seconds.

# A second run got through 60893 files in 5m23s, or 323s, which is 188
# files per second.  This isn't reassuring that I'm measuring
# performance accurately but at least I know there's no substantial
# N^2 term.

# third run: 234502 files in 20m45s.  Again 188 files per second.  But
# then the next time around, it took 80 seconds without successfully
# readdirring anything.

time perl -e 'opendir MB, "mboxtmp" or die; 
     while (my $x = readdir MB) { print "$x\n" }
' | perl -ne '$| = 1; 
              chomp; 
              unlink "mboxtmp/$_" or warn "$_: $!";
              print "$.\r"'

time perl -e 'opendir MB, "mboxtmp.2" or die; 
     while (my $x = readdir MB) { print "$x\n" }
' | perl -ne '$| = 1; 
              chomp; 
              unlink "mboxtmp.2/$_" or warn "$_: $!";
              print "$.\r"'

Thu, 12 Jun 2008

It was kind of a chilly late autumn here in Buenos Aires as I wrote
this, 2008-04-19; but, because of this [rather remarkable
environmental
phenomenon](http://modis.gsfc.nasa.gov/gallery/individual.php?db_date=2008-04-18
MODIS image of the fires and smoke), I had the air conditioner turned
on full blast.

This is apparently the best we can do at avoiding the smoke that
blankets the city, and I thought I should document the crazy scheme
I was using, since it seemed to work.

The Crazy Scheme
----------------

I heated the house a lot using the water heater; like many Argentine
houses, we have one of those wonderful tankless water heaters, which
can provide an unlimited supply of scalding hot water unless they
catch on fire or melt or something.  (Don't laugh; our friends Kevin
and Alicia's caught on fire several times.)  So I ran the shower very
slowly to fill up the jacuzzi tub (we have a jacuzzi tub, you see),
then started the bubble jets running to transfer more of the heat into
the air, while using a fan to vent the hot, humid air from the
bathroom to the rest of the house, to give the air conditioner more
heat to chew on.

(Unlike, say, the stove, the hot water heater is properly vented to
outside, with a little horizontal metal chimney.  And it has a hell of
a gas burner inside.)

I ran the air conditioner because it does remove some of the smoke
from the air, but it's driven by a thermostat, so it would stop
running if the air cooled off.  So that's why I used the hot water
heater to heat up the house.

It turns out that the hot shower and the jacuzzi bubble jets also
remove smoke from the air.  There was quite a bit of smoke deposited
around the edge of the bathtub now from the bubbles.  Photos will be
forthcoming eventually if I didn't lose them all.

The humidity added to the air also ameliorated the eye and bronchial
irritation from the smoke.

Fine-Tuning
-----------

I added a little shampoo to the water in order to make it pick up
smoke particles and transfer heat better.  (Smaller bubbles, less
spherical bubbles, and less tendency for the water to repel smoky oil
particles.)  Unfortunately, jacuzzis being jacuzzis, this resulted in
a huge mass of bubbles growing from the water and threatening to swamp
the bathroom.

So I added a little conditioner, and the bubbles died back down.

In general, the surfactant action I was looking for and the foaming
action I got are not inseparable.  I wonder if I could get better
results with dishwasher detergent, for example (although ideally
without bleach), industrial degreasers, or some simple combination
like Calblend plus simethicone.

The placement of the fan in the bathroom door proved important.  Ideal
would be a fan at the other end of a long duct, either blowing cool,
dry air into the bathroom, or sucking hot, wet air out of the top of
the bathroom.  What seemed to work best with the fan I have was
blowing hot, wet air out of the bathroom and into the bedroom, where
the air conditioner is.  This produced a certain amount of fog.

Real Air Filtration
-------------------

I was hoping to try to get a HEPA filtration system for the house.  I
knew it would't be easy, because there aren't many stores here that
sell them, and due to the smoke, even the stores that carry them
seemed to be out.  So we made do with N95 respirators, which I bought
at a pharmacy that evening after hunting for hours for something
better.

We spent a lot of yesterday and this morning with wet bandannas around
our faces, inside the house.  I have a headband that I used to secure
the bottom of the bandanna against my chin, and I rolled up strips of
paper towel to put on each side of my nose to block the spaces there.
That seemed to provide some noticeable protection.

I picked up some deionized water at Carrefour that evening; I figured
it might work better for air filtration than tap water.  I'm not sure
whether it did or not.

I wish I had some way of measuring the smoke, other than by counting
how often I cough, because I'd like to know (for example) whether the
N95 respirators work better than the bandannas, or even whether
they're effective at all against smoke.

Mon, 09 Jun 2008

On Sun, 8 Jun 2008 20:19:39 +0200, Aristotle Pagaltzis wrote:
> * Kragen Javier Sitaker <kragen at canonical.org> [2008-06-08 18:00]:
> > (I also have this theory that
> > I'll be a happier person if I don't spend too much time on
> > inventing private markup languages, and that makes me perhaps
> > unreasonably reluctant to add extra postprocessing steps, even
> > when the alternative is ill-formed HTML.)
> 
> I agree with that theory, which is why I use Markdown despite the
> minor (and a few less minor) things I dislike about it. Attaching
> some light postprocessing to Markdown to add support for small
> missing bits seems like a much better approach to achieve both
> satisfaction and an absence of unhappiness.

Agreed, especially if there's some graceful degradation.

> I try to write code only when I can’t help it…

I try to accumulate code I have to maintain only when I can't help it.
Writing code, however, is fun, and I probably reinvent the wheel quite
a bit.

On Sat, 7 Jun 2008 22:34:30 -0400, Waylan Limberg wrote:
> Well, first of all, he's using an old version of Python-Markdown.

Yes, 1.4-2, the version that's in Debian Etch.  System administration
is not really a favorite hobby of mine; I tend to stick with the
versions of things in my distribution unless I have a good reason to
upgrade.

> The first line of his `render` function gives it away (due to the
> change in 1.7 to all-unicode -- you generally don't pass unicode
> text to str())
> 
>         body = str(markdown.Markdown(text))
> 
> Just use the wrapper (all lowercase in any version):
> 
>         body = markdown.markdown(text)

Thanks!

Kragen

(This is available in HTML at <http://pobox.com/~kragen/science-espeak.html>.)

So I've been playing around with speech synthesis software tonight.
[eSpeak](http://espeak.sourceforge.net/) looks a lot nicer than
[Festival](http://www.cstr.ed.ac.uk/projects/festival/), just in that
it's much easier to adjust its speed, correct its pronunciation, and
play with variations: whisper, different accents, pitch, word spacing,
creaky voice.  I got to thinking, what would a logical policy for
updating its lexicon look like?  I thought the results I came up with
were interesting.  Maybe some other people will be interested too.

The problem
-----------

[eSpeak](http://espeak.sourceforge.net) gets "neuroscience" and
"pseudoscience" wrong, pronouncing them with a `[[s,i at ns]]` rather
than a `[[s'aI at ns]]`.  It also gets "omniscience" and "prescience"
wrong, or at least pronounces them rather differently than I would:

    $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "The 
        science of neuroscience is not a scientific or quasiscientific
        pseudoscience.  Conscientiously pursue omniscience and prescience."
     D at 2 s'aI at ns Vv n'3:r- at s,i at ns I2z n,0t#@ saI at nt'IfIk _:_:O@ kw,eIzaIsi at nt'IfIk sj'u:d at s,i at ns
     k,0nsI2;'EnS at sli p3sj'u: '0mnIs,i at ns _:_:and pr'i:si at ns

I would pronounce the "science" in "omniscience" and "prescience" as
`[[S at ns]]` and put the accent on another syllable.

There's a special rule for "scien" beginning a word, and for
"conscience":

    en_list:conscience       k0nS at ns
    en_rules:       _sc) ie (n        aI@
    en_rules:?8     _sc) ie (n        aIa2

However, Jonathan Duddington has said he wants to keep the eSpeak
distribution small, so he "wouldn't want to include too many unusual
or specialist words".  (See
<http://sourceforge.net/forum/forum.php?thread_id=1700280&forum_id=538920>
where he talks about why he doesn't want to import the Festival
lexicon.)  Already, `espeak-data/en_dict` is 80KB, which is half the
size of the `speak` binary.

Replacement strategies
----------------------

There are several possible strategies that a maintainer could adopt in
order to improve the coverage of their special-case word files without
letting them get large.  Suppose that there is a scalar metric of
"goodness" that can be applied independently to each special case.
Here are three plausible strategies, ordered from least to most
stringent.

- C-: They could never remove items from the file, adding new items as
  long as they were better than the worst item in the file.  This will
  probably cause the average quality of the entries in the file to
  gradually decline, because many of the most important entries were
  probably added early on.  It will eventually result in a very large
  file with very low average quality per entry, but very comprehensive
  coverage.
- C+: They could keep the number of items in the file fixed, adding
  new items as long as they were better than the worst item in the
  file.  This will cause the program to gradually work better, but
  each new version will introduce regressions --- words that the
  previous version pronounced correctly, but the new one does not.
- A: They could never remove items, but add new items as long as they
  improved the median item quality of the file --- that is, as long as
  the new item improved the program's performance more than most of
  the items in the file.  This will gradually slow down and eventually
  stop the addition of new items, because that median quality will
  gradually increase.

I am going to approximate "quality" with "frequency", on the theory
that mispronouncing a rare word is always better than mispronouncing a
common one.

Note the analogy to Google's famous hiring policy: only hiring
candidates who raised their average ability.

Evaluating word frequencies
---------------------------

Are these "science" words significant enough to include?  `en_list`
only contains 2869 lines, maybe 2400 of which are words.  So maybe
only the top 2400 or so exceptions to the normal rules of
pronunciation are currently considered for inclusion.

Some time ago, I tabulated the frequencies of words in the British
National Corpus and put the results online at
<http://pobox.com/~kragen/sw/wordlist>.  It has 109557 lines, ordered
from the most common words ("the", "of", and "and", each occurring
millions of times) to the least common (with a cutoff of 5
occurrences, because most of the words with fewer were actually
misspellings).

I selected 20 lines at random from `en_list` with the following
results:

    kragen at thrifty:~/pkgs/espeak-1.37-source/dictsource$ ~/bin/unsort < en_list | head -20
    this             %DIs          $nounf $strend $verbsf
    barbeque         bA at b@kju:
    con              k0n
    ?5 thu  TIR        // Thursday
    _:      koUl at n
    Ukraine         ju:kr'eIn
    peculiar         pI2kju:lI3
    unread           Vnr'Ed        $only
    inference        Inf at r@ns
    José            hoUs'eI
    unsure           VnS'U@
    survey                         $verb
    ë       $accent
    epistle          I2pIs at L
    Munich          mju:nIk
    scenic           si:nIk
    synthesise       sInT at saIz
    corps            kO@           $only
    rajah            rA:dZA:
    transports       transpo at t|s    $nounf

Where do these special cases appear in the British National Corpus
tabulation?  Here are some results, edited for readability:

    kragen at thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -niE ' (this|barbeque
       |con|thu|ukraine|peculiar|unread|inference|José|unsure|survey|epistle|munich
       |scenic|synthesise|corps|rajah|transports)$' /home/kragen/devel/wordlist
    22:463240 this
    1178:7999 survey
    5102:1441 peculiar
    5831:1200 corps

    7165:888 ukraine
    8977:634 munich
    9045:627 unsure
    10552:494 inference

    11134:455 con

    15127:275 scenic
    29899:82 epistle
    31386:74 transports
    34270:62 synthesise

    37255:52 unread
    73679:11 thu
    74154:11 rajah
    87737:8 barbeque

The 50th-percentile among the sample of 20 (of which two weren't
words, and a third wasn't found) seems to be line 11 134 with the word
"con".  That is, the exceptions in `en_list` are mostly drawn from the
most frequently used eleven thousand words in the language.  (Maybe
words like "barbeque", "rajah", and "unread" should be dropped.)

So under the policies "C+" and "C-", any word that is more common than
"barbeque", at position 87737 in the British National Corpus
tabulation, (or maybe some word even a bit rarer than that) should be
added to the file.  (Under policy "C+", some word would be removed to
compensate, raising the threshold.)  Under the policy "A", the
threshold would be "con", at position 11 134.

Unfortunately, José is missing.  I think I excluded accented
characters when I tabulated the frequencies initially.

Anyway, that gives us a way to compare the "science" words:

    kragen at thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -n scien[tc]
        /home/kragen/devel/wordlist 
    870:10597 science
    1614:5922 scientific
    2584:3547 scientists
    3865:2088 sciences
    3977:2005 scientist
    5342:1355 conscience

    13365:338 conscientious
    16976:227 scientifically
    25757:109 consciences
    26015:107 conscientiously
    27861:93 unscientific
    37040:53 omniscient
    44349:36 prescient
    49031:29 neuroscience
    49706:28 prescience
    50457:27 scientificity
    50587:27 omniscience
    53155:24 scientism
    62346:17 geoscience
    66943:14 scientia
    67285:14 neuroscientists
    68176:14 conscientiousness
    82060:9 geoscientists
    84433:8 scientology
    84434:8 scienter

    86513:8 geosciences
    90235:7 neurosciences
    93073:7 biosciences
    93074:7 bioscience
    95039:6 scientifique
    95591:6 pseudoscience
    103190:5 presciently
    103191:5 prescientific

Of these, only those more common than "conscience" seem to deserve a
place in `en_list`.  How does eSpeak do now?

    $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "Science is 
        scientific and done by scientists, who work in the sciences.  A 
        scientist with a conscience may be conscientious.  Those with 
        scientifically-minded consciences will conscientiously avoid 
        unscientific claims of omniscient beings or prescient prophets."
     s'aI at ns I2z saI at nt'IfIk _:_:and d'Vn baI s'aI at nt#Ists
     _:_:h,u: w'3:k I2nD at 2 s'aI at nsI2z
     a2 s'aI at nt#Ist wI2D a2 k'0nS at ns m'eI bi: k,0nsI2;'EnS at s
     DoUz wI2D saI at nt'IfIkli m'aIndI2d k'0nS at nsI2z wIl k,0nsI2;'EnS at sli; a2v'OId
     VnsaI at nt'IfIk kl'eImz Vv '0mnIs,i at nt b'i:;INz _:_:O@ pr'i:si at nt pr'0fIts

It pronounces everything correctly until it gets to "omniscient" and
"prescient", and maybe its pronunciations for those are correct, but
at least they're not the pronunciations I would use.

Under policy "A", those words are not common enough to add to
`en_list`, because they would lower the average frequency of words in
`en_list` unless you removed a less common word to compensate.

Under policies "C+" and "C-", not only "omniscient" and "prescient"
qualify, but so do "neuroscience", "geoscience", "neuroscientists",
and "geoscience", which eSpeak currently mispronounces.

(Including all the exceptions that as rare as "prescient" might
quadruple the size of `en_list`, and perhaps `en_dict` as a result, if
arbitrary spellings were as common among rare words as they are among
common words.  Think of that as an upper bound.  Including all the
exceptions as rare as "neuroscientists" might multiply its size by
seven.  This is the downside of policy "C-", but it does not happen
with policy "C+".  On the other hand, under policy "C+", even
"prescient" might not survive long after being added.)

Recommendation
--------------

There is a better solution than adding a bunch of one-word special
cases to `en_list`.

Probably in this case the solution is to change the special case for
"conscience" to a special case for "conscien..."  and change the
"scien..." rule to a "...scien..." rule; that covers all the words
except for "omniscien..."  and "prescien...".  Covering those two
takes only two more rules in `en_rules`, if it's considered
worthwhile; but "conscience" is ten times as common as both of those
together, "con" three times as common, but "barbeque" 18 times less
common.

Alternatives
------------

I think there is a need for a larger `en_list` and `en_rules` to be
available, even if they aren't part of the standard distribution.
eSpeak's current footprint for a single language is about 160KB for
the executable and 80KB for the dictionary.  But it would be useful in
many cases even if its dictionary were 800KB (as perhaps it would be
with the Festival lexicon) or 8MB.

And for a better user interface for making changes to the dictionary,
and especially `en_rules`, since currently it's hard to know what
words you're changing the pronunciation of when you change `en_rules`,
and you have to master a phonological orthography system to make any
contribution at all.  And then there's no `git`-like infrastructure
for sharing your changes, and even learning `git` is a pretty big
barrier to contributions.

If, instead, you could twist a knob to jog back to the last
mispronounced word, then hold down a button and say its correct
pronunciation, the barrier to contributions would be much lower.  You
would need a reasonable phonological analysis system (like in a
speech-to-text system) to turn the spoken word into the string of
phonemes.  Then, if you could share your accumulated corrections with
all other users of the software with the push of a button, the process
of coming up with the tens of thousands of special cases would be a
lot quicker.

Sun, 08 Jun 2008

[I tried to send this before, but I guess I need to authorize
kragen at canonical.org to send to the list...]

On Sat, 7 Jun 2008 17:42:53 +0200, Aristotle Pagaltzis wrote:
> [Note to markdown-discuss readers: for context see
> <http://lists.canonical.org/pipermail/kragen-hacks/2008-June/000488.html>]
> 
> * Kragen Javier Sitaker <kragen at pobox.com> [2008-06-07 09:40]:
> > Stylesheeting comes naturally. I just put a `<style>` element
> > at the top with a few lines inside of it to format nicely.
> 
> Note that Markdown ends up wrapping `<link>` and `<style>`
> in `<p>` tags, arguably erroneously.

What, is Markdown supposed to know that those elements are neither
block-level nor span-level markup?

> Of course, neither tag has any business being in the HTML body;
> they should both be in the head. Since you’re loading
> BeautifulSoup anyway, you probably want to include that as fix-up
> step in your postprocessing.

Yeah, I've been planning to do that, but didn't get around to doing it
before sending it out.  (I also have this theory that I'll be a
happier person if I don't spend too much time on inventing private
markup languages, and that makes me perhaps unreasonably reluctant to
add extra postprocessing steps, even when the alternative is
ill-formed HTML.)

Kragen

* Kragen Javier Sitaker <kragen at canonical.org> [2008-06-08 18:00]:
> On Sat, 7 Jun 2008 17:42:53 +0200, Aristotle Pagaltzis wrote:
> > * Kragen Javier Sitaker <kragen at pobox.com> [2008-06-07 09:40]:
> > > Stylesheeting comes naturally. I just put a `<style>`
> > > element at the top with a few lines inside of it to format
> > > nicely.
> > 
> > Note that Markdown ends up wrapping `<link>` and `<style>` in
> > `<p>` tags, arguably erroneously.
> 
> What, is Markdown supposed to know that those elements are
> neither block-level nor span-level markup?

In fact it is – or to phrase the statement more appropriately, it
is supposed to know that wrapping such elements in `<p>` tags is
not a useful thing to do. It already avoids wrapping block-level
tags in paragraphs; there is no reason not to extend that list to
also contain all the elements that may appear in the head of an
HTML document, even though they are nominally invalid in the body.

> > Of course, neither tag has any business being in the HTML
> > body; they should both be in the head. Since you’re loading
> > BeautifulSoup anyway, you probably want to include that as
> > fix-up step in your postprocessing.
> 
> Yeah, I've been planning to do that, but didn't get around to
> doing it before sending it out.  (I also have this theory that
> I'll be a happier person if I don't spend too much time on
> inventing private markup languages, and that makes me perhaps
> unreasonably reluctant to add extra postprocessing steps, even
> when the alternative is ill-formed HTML.)

I agree with that theory, which is why I use Markdown despite the
minor (and a few less minor) things I dislike about it. Attaching
some light postprocessing to Markdown to add support for small
missing bits seems like a much better approach to achieve both
satisfaction and an absence of unhappiness.

I try to write code only when I can’t help it…

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Sat, 07 Jun 2008

On Sat, Jun 7, 2008 at 11:42 AM, Aristotle Pagaltzis <pagaltzis at gmx.de> wrote:
> [Note to markdown-discuss readers: for context see
> <http://lists.canonical.org/pipermail/kragen-hacks/2008-June/000488.html>]
>
> * Kragen Javier Sitaker <kragen at pobox.com> [2008-06-07 09:40]:
>> Stylesheeting comes naturally. I just put a `<style>` element
>> at the top with a few lines inside of it to format nicely.
>
> Note that Markdown ends up wrapping `<link>` and `<style>`
> in `<p>` tags, arguably erroneously. Weirdly, it looks like
> Python-Markdown should avoid that mistake:

Well, first of all, he's using an old version of Python-Markdown. The
first line of his `render` function gives it away (due to the change
in 1.7 to all-unicode -- you generally don't pass unicode text to
str())

        body = str(markdown.Markdown(text))

Just use the wrapper (all lowercase in any version):

        body = markdown.markdown(text)

IIRC, there was a bugfix in 1.7 that also addressed the raw html
wrapped in <p> tags thing. So, upgrade to 1.7 and that problem should
go away.



-- 
----
Waylan Limberg
waylan at gmail.com
_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss at six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

[Note to markdown-discuss readers: for context see
<http://lists.canonical.org/pipermail/kragen-hacks/2008-June/000488.html>]

* Kragen Javier Sitaker <kragen at pobox.com> [2008-06-07 09:40]:
> Stylesheeting comes naturally. I just put a `<style>` element
> at the top with a few lines inside of it to format nicely.

Note that Markdown ends up wrapping `<link>` and `<style>`
in `<p>` tags, arguably erroneously. Weirdly, it looks like
Python-Markdown should avoid that mistake:

* <http://babelmark.bobtfish.net/?markdown=%3Cstyle%3Efoo+%7B%7D%3C%2Fstyle%3E>
* <http://babelmark.bobtfish.net/?markdown=%3Clink+%2F%3E>

But at the top of <http://canonical.org/~kragen/crywrap.html> the
problem shows up anyway.

Of course, neither tag has any business being in the HTML body;
they should both be in the head. Since you’re loading
BeautifulSoup anyway, you probably want to include that as fix-up
step in your postprocessing.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Since working on this new project where we're using
<http://jottit.com/> for our to-do list, I've become enamored of
Markdown.  So I wrote this script to allow me to write documents
originally in Markdown and then generate HTML versions.

This works particularly well with Emacs `longlines-mode`.  I wrote the
first draft of my new company's "Acta Constitutiva" (i.e. charter and
bylaws) in it.  With CSS and with `M-x recompile` bound to the F5 key,
and `compile-command` set to `(cd ~/distributed-expertise;
~/devel/mkhtml.py bylaws; iceweasel bylaws.html)`, it was a pretty
reasonable word-processing experience, preferable to OpenOffice Write
for the following reasons:

- On my 384MB 700MHz laptop, OpenOffice is painfully slow; Emacs
  screams.
- My Emacs has an input method that handles Spanish reasonably;
  OpenOffice might, but I haven't been able to find it.
- I could edit in HTML and CSS in the cases where I wanted it, and
  pretend I was just editing a normal text file the rest of the time,
  with all of the normal Emacs amenities.  Except with
  word-processor-style word wrap instead of this M-q crap.
- Stylesheeting comes naturally.  I just put a `<style>` element at
  the top with a few lines inside of it to format nicely.
- I can see more of the document at a time in Emacs.

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

#!/usr/bin/python
"""Turn Markdown documents into HTML documents.

Depends on python-markdown and Beautiful Soup.

Markdown normally generates HTML document content; this generates HTML
documents instead.

"""
import markdown, BeautifulSoup, sys, os, os.path

def render(text):
    "Given Markdown input as a string, produce an HTML document as a string."
    body = str(markdown.Markdown(text))
    soup = BeautifulSoup.BeautifulSoup(body)

    headers = soup('h1')
    if len(headers) > 0:
        title = headers[0].renderContents()
    else:
        title = 'Lame document with no top-level header'

    return '''<html><head><title>%s</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    </head>
    <body>%s</body></html>''' % (title, body)

def process(infile):
    "Given a filename of Markdown input, create an HTML file as output."
    outfile = infile + '.html'

    if os.path.exists(outfile) and \
           os.stat(outfile).st_mtime > os.stat(infile).st_mtime:
        print "`%s` is newer than `%s`, skipping  " % (outfile, infile)
        return

    outfiletmp = outfile + '.tmp'
    fo = file(outfiletmp, 'w')
    fo.write(render(file(infile).read()))
    fo.close()

    os.rename(outfiletmp, outfile)  # atomic replace; won't work on Win32
    print "rendered `%s` to `%s`  " % (infile, outfile)

def main(args):
    filenames = args[1:]
    if filenames:
        for filename in filenames: process(filename)
        return 0
    else:
        print ("usage: `%s foo bar baz`; implicitly writes to `foo.html`, etc."
               % args[0])
        return 1

if __name__ == '__main__':
    sys.exit(main(sys.argv))

Thu, 05 Jun 2008

I'm running QEMU with kqemu on my old 700MHz laptop.

User-mode stuff is slowed down only slightly.  This command line:

    time for x in $(seq 10000); do :; :; :; :; done

takes 1.17 1.19 1.20 1.22 user seconds in emulation and 1.13 1.13 1.14
1.14 user seconds outside QEMU.

However, it takes about 100ms of system time in place of about 10ms.
(The `-kernel-kqemu` flag may solve this; haven't measured.)

I had some kind of keyboard problem when I ran QEMU 0.8.2-4etch1 with
`-snapshot`.  Like, the keyboard just didn't work.  That problem went
away when I built QEMU 0.9.1 from source and started using that, but I
still can't use `-snapshot` and `-loadvm` together.

Networking: `tap`
-----------------

This was a bad idea (for me).

By default, QEMU uses `user` networking, which proxies network
connections through normal sockets, like `slipknot` or `slirp` or
`term`.  (In fact, it uses `slirp`.)  I thought this didn't give me a
way to talk to it over the network (for example, if I'm running a web
server on it).

So I thought `-net tap` could help with this, but it has some
drawbacks.  It requires running QEMU as root, and then the network
interface on the emulated machine needs to be configured statically,
e.g. in `/etc/network/interfaces`, since `-net tap` doesn't provide
DHCP by default.  And then you have to set up IP masquerading, more or
less as follows:

    qemu -net nic -net tap,script=ifup "$image"

In file `ifup`:

    set -e
    /sbin/ifconfig "$1" 172.20.0.1
    echo 1 > /proc/sys/net/ipv4/ip_forward
    /sbin/iptables -t nat -A POSTROUTING --source 172.20.0.0/24 -j MASQUERADE

This does actually work, but you have to configure the network stuff
inside of QEMU: IP address, netmask, default gateway, and worst of
all, DNS server.  And I think it might allow other people on your LAN
to masquerade through you.

What would be ideal would be bridging the virtual interface to my real
Ethernet interface, but I never got around to doing this.

Networking: `-redir`
--------------------

It turns out there's an easier way.  I can use the default `user`
networking, and if I have a web server on the emulated host on port
8080, I can say

    qemu -redir tcp:8000::8080 "$image"

and connect my web browser to <http://localhost:8000/>.

This works beautifully.  The one downside I've found is that if you're
using `qemu -loadvm`, the inner virtual machine has to re-request DHCP
before the redirection works.

Startup: `-loadvm`
------------------

Bootup takes an annoyingly long time.  But, if you don't regularly
have any permanent changes you want to save, you can use the `savevm`
command to save an image of the virtual machine state after a boot,
and then use `qemu -loadvm` to start QEMU in the already-booted state.

Tue, 03 Jun 2008

Oops,

* Aristotle Pagaltzis <pagaltzis at gmx.de> [2008-06-02 18:00]:
> you really want to check out [`socat`][]

[socat]: http://www.dest-unreach.org/socat/

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Mon, 02 Jun 2008

Hi Kragen,

* Kragen Javier Sitaker <kragen at canonical.org> [2008-06-02 09:40]:
> OpenSSL has the `openssl s_client` command, which is like an
> SSL version of `netcat`, and also `openssl s_server`.  These
> should be very handy for troubleshooting SSL stuff in general.

you really want to check out [`socat`][], which does everything
that `netcat`, `s_client` and a bunch of other tools do, plus
more besides. My favourite feature is that it can use not only
raw stdin as a source, but also a readline prompt, so you can
chatter at your mail- or webserver with all of the editing
conveniences you enjoy in shell.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Hi Kragen,

* Kragen Javier Sitaker <kragen at canonical.org> [2008-06-02 09:40]:
> (Available in HTML at <http://canonical.org/~kragen/crywrap.html>.)

you know, now that HTML versions of your -tol posts are regularly
featured, it wouldn’t be very hard to make a script that takes
them and sticks them into a feed… ;-)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

An HTML attachment was scrubbed...
URL: http://lists.canonical.org/pipermail/kragen-discuss/attachments/20080602/832cdc12/attachment.htm

(Available in HTML at <http://canonical.org/~kragen/crywrap.html>.)

So, in March, we upgraded our machine from Sarge to Etch.  We had been
using `sslwrap` in Sarge, but `sslwrap` doesn't exist in Etch.
According to Jonathan McDowell, the guy who used to maintain the
Debian `sslwrap` package:

    sslwrap (2.0.6-18) unstable; urgency=low
    
     * Users might like to consider switching away from sslwrap to
       crywrap or investigating whether more recent versions of the
       services they're sslwrapping are themselves now ssl enabled. It
       is envisaged that at some point in the future I will request
       removal of sslwrap from the archive, though I hope to
       investigate the possibility of a smooth upgrade path to crywrap
       before that happens. sslwrap is effectively dead upstream and I
       think it's probably better to consider the existing
       alternatives that can perform the same function than continue
       to work on sslwrap long term.
    
     -- Jonathan McDowell <noodles at earth.li> Sat, 13 Aug 2005 13:01:06
        +0100

(from <http://ubuntu2.cica.es/ubuntu/ubuntu/pool/universe/s/sslwrap/sslwrap_2.0.6-18.diff.gz>)

See also <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=374521>,
where the maintainer requested its removal from Debian.

Why I Tried, Then Gave up on `CryWrap`
------------------------------------

`apt-cache search sslwrap` found only `crywrap`, and `apt-cache show
crywrap` said:

> `CryWrap` is intended to be a drop-in replacement for `sslwrap`.

This is more or less a blatant lie.  `CryWrap`'s command-line options
have nothing in common with `sslwrap`'s, and `sslwrap` is written to
run from `inetd` --- for example, it reports its errors through
`syslogd`, not by printing them to standard error.  `CryWrap` is not,
even though it has an `--inetd` option.

Here are the problems I encountered trying to use `CryWrap`:

1. `CryWrap` doesn't support `sslwrap` command-line options.
2. `CryWrap` reports its errors to `stderr`.
3. When I did get `CryWrap` to work, it reliably took about 110 seconds
   to negotiate an SSL connection, which is longer than Thunderbird is
   willing to wait.
4. `CryWrap`'s `sslwrap` wrapper doesn't support `inetd`.
5. `CryWrap`'s documented `-v` flag doesn't work as documented.

I wasted an hour trying to get `CryWrap` to work, and eventually gave
up and installed `stunnel4` instead.  Here are some details, in case
you find yourself in a similar situation:

### `CryWrap` doesn't support `sslwrap` command-line options. ###

The line in our /etc/inetd.conf for `sslwrap` looked like this, all on
one line:

    pop3s stream  tcp nowait root /usr/sbin/tcpd 
        /usr/sbin/sslwrap -cert /etc/sslwrap/server.pem 
                          -addr 127.0.0.1 
                          -port 110

I `apt-get install`ed `crywrap`, changed `sslwrap` to `crywrap`, and
hoped for the best.  Initially, that failed because `tcpd` was specially
configured to allow connections to `sslwrap` from weird places, in
`/etc/hosts.deny`:

    ALL EXCEPT sslwrap: PARANOID EXCEPT <censored>

Our Argentine ISP, TeleCentro, is so incompetent that our reverse DNS
maps our IP address to a name that doesn't exist.  So `tcp_wrappers`'s
`PARANOID` rule won't allow us to connect to tcp-wrapped services.  So
I changed that to say

    ALL EXCEPT crywrap: PARANOID EXCEPT <censored>

Then I ran into the problem that we had removed the `/etc/sslwrap`
directory and the `server.pem` file inside it that contained the
server's private key.  After a little bit of digging, I found out
where to get the private key file, stuck it in
`/etc/crywrap/server.pem`, and put the following completely wrong line
in `/etc/inetd.conf`:

    pop3s stream  tcp nowait root /usr/sbin/tcpd 
        /usr/sbin/crywrap -cert /etc/crywrap/server.pem 
                          -addr 127.0.0.1 
                          -port 110

You see, at this point, I still believed the package description that
claimed that `CryWrap` was "a drop-in replacement".  I got a log line
(apparently from `tcpd`) that said the connection had been made:

    Mar 29 19:51:03 panacea crywrap[27458]: 
        connect from 190.55.55.32 (190.55.55.32)

But the `crywrap` process had died, and there were no error messages in
any of the `/var/log` files explaining why.  This is because of problem
#2, "`CryWrap` reports its errors to `stderr`," which I explain below.
Upon consulting the man page and debugging from the error message for
a while, I ended up with the following line in `/etc/inetd.conf`
instead:

    pop3s stream tcp nowait root /usr/sbin/tcpd
        /home/kragen/crywrap -d 127.0.0.1/110 -i

`/etc/crywrap/server.pem` is the default location for `CryWrap` to look
for a server certificate, so I omitted it from the command line.

### `CryWrap` reports its errors to `stderr`. ###

In order to find out what was wrong, I temporarily ran
`/home/kragen/crywrap` instead of `/usr/sbin/crywrap` from `inetd`.
`/home/kragen/crywrap` is this script:

    #!/bin/sh
    /usr/bin/strace -s4096 -o /tmp/crywrap.strace /usr/sbin/crywrap "$@"

And it turned out that `CryWrap` was writing its error messages to
`stderr`, file descriptor 2, instead of to a log file.  `stderr`n in a
process run from `inetd` is actually connected to the socket talking
to the client, so writing error messages to it is almost certain to
violate the protocol expected by the client.  Here's one sample error
message (a result of problem #4, "`CryWrap`'s `sslwrap` wrapper
doesn't support `inetd`", below) from strace's output (wrapped for
readability):

    write(2, "crywrap", 7)                  = 7
    write(2, ":", 1)                        = 1
    write(2, " ", 1)                        = 1
    write(2, "Could not resolve address: `/\'", 30) = 30
    write(2, "\n", 1)                       = 1
    write(2, "Try `crywrap --help\' or `crywrap --usage\' for 
        more information.\n", 64) = 64
    exit_group(64)                          = ?

This was at the very end of the file.

Now, this would not be such a heinous sin in a program that was
intended to speak, say, SMTP.  If a fatal error message gets sent to
an SMTP client, it's likely to end up somewhere that a human being can
see it and diagnose the problem.  But SSL is a different matter.  SSL
connections are normally full of random toxic binary data, so almost
no SSL-speaking programs will dump out that data on a human when
there's a connection failure.  So the only way I was able to find
these error messages was by running the program under `strace(1)`.

### `CryWrap` took about 110 seconds to negotiate an SSL connection. ###

Once I got `CryWrap` to run, my wife Beatrice was still reporting
failures getting her mail in Thunderbird.  `strace` showed that
`CryWrap` was running and receiving data (`less /tmp/crywrap.strace`
and then typing `>F` was very helpful to watch this in real time), but
it was receiving it very slowly, a few bytes every few seconds.

At Paul Visscher's suggestion, I tested the connection myself with the
OpenSSL package's `openssl` command:

    openssl s_client -connect panacea.canonical.org:pop3s

This did eventually connect and allow me to speak POP (simulated
copy-and-paste here may contain errors):

    ...
        Timeout   : 300 (sec)
        Verify return code: 21 (unable to verify the first certificate)
    ---
    +OK
    USER imaptest
    +OK
    PASS <censored>
    +OK
    QUIT
    DONE

However, it took about a minute and 51 seconds.  This is apparently
more than Thunderbird's timeout.  I don't know enough about SSL to
know why this might be.  `CryWrap` reported it with these `syslog`
messages (wrapped and trimmed for readability):

    crywrap[27830]: Accepted connection from 190.55.55.32 on 0 to 
        127.0.0.1/110
    crywrap[27830]: Handshake failed: A TLS packet with unexpected 
        length was received.

I never did figure out why this happened, and so I gave up on `CryWrap`
and switched to stunnel4 (see below).

### `CryWrap`'s `sslwrap` wrapper doesn't support `inetd` ###

There is a shell script in /usr/share/crywrap/`sslwrap` that intends to
make crywrap act like `sslwrap`, but it doesn't consider the case of
trying to run from inetd (-i or --inetd) flag.  Because it's a
badly-written shell script, it doesn't notice that its "listen port"
parameter is missing; it merely tries to invoke `CryWrap` with "-l /"
(`CryWrap` uses a slash to separate IP address from port, instead of the
traditional colon; in this case, both the IP address and the port are
missing, leaving only the lonesome "/", like a girl who's been stood
up on a date.

`CryWrap` reports this by sending the helpful message:

    crywrap: Could not resolve address: `/'

to the would-be SSL client.  I extracted it from an `strace` output
file in `/tmp`, except that I had to use `strace -ff` to follow the
children of the `/usr/share/crywrap/sslwrap` script.  (I guess I could
have just redirected `stderr` to a file instead of using `strace`.)

### `CryWrap`'s documented `-v` flag doesn't work as documented. ###

`CryWrap`'s man page documents a `-v` flag.  `-v 0` is documented to
turn off client certificate validation, although having it turned off
is documented to be the default.  We thought that perhaps the default
was actually something other than what it was documented to be,
because on the successful `openssl s_client` connections (see above
under #3), we were getting this message:

    crywrap[28190]: Error getting certificate from client: The peer
        did not send any certificate.

And it seemed plausible that this might explain the slowness (#3).  So
I tried adding `-v 0` to the command line, because the man page says:

> `--verify` (`-v`) [LEVEL]
> 
> > Set the level of client certificate verification. Level
> > one simply logs the result, level two and above abort if
> > the certificate could not be verified.  
> > Default is 0.

If you actually try running crywrap with `-v 0`, you get this error
message:

    kragen at panacea:~$ /usr/sbin/crywrap -l /3802 -d /110 -v 0
    crywrap: Too many arguments
    Try `crywrap --help' or `crywrap --usage' for more information.

Except that I didn't originally get the error message at the command
line; I had to dig it out of `strace` output in `/tmp` after editing
`/etc/inetd.conf` and restarting `inetd`.  It turns out that `-v0` is
the supported syntax, despite what the man page says, and in violation
of the usual Unix conventions.  No space is permitted.

Success with `stunnel4`
-----------------------

I did this:

    $ sudo apt-get install stunnel4

Then, after skimming the `stunnel` man page, I stuck this in
`/etc/inetd.conf` (all on one line) in place of the `crywrap` line:

    pop3s stream tcp nowait root /usr/sbin/tcpd 
        /usr/bin/stunnel -p /etc/crywrap/server.pem -r 110

That worked.  Then I moved `/etc/crywrap/server.pem` to
`/etc/stunnel/server.pem` and all was good.  The total elapsed time
since giving up on `CryWrap` was just under eleven minutes.

Things I Learned
----------------

Or was reminded of.

0. It's easy to underestimate how much of a pain in the ass your
   software will be for other people.  Presumably `CryWrap`'s author
   wouldn't have had any of the above problems (except for #3, and he
   could have probably diagnosed that one).

1. If I write software and claim it's a "drop-in replacement" for
   something else, someone is going to be sad.  Or pissed off.
   Because I'll probably forget something.  (Although hopefully I'll
   do better than this!)

2. It's good to be careful about where error messages go.

3. I should try to make sure that my software handles errors
   (e.g. missing listen port) in a graceful fashion, i.e. by bombing
   out with an error ("listen port required") instead of proceeding to
   invoke something else with some broken default (in this case, the
   empty string) and relying on it to emit a useful error message
   (``crywrap: Could not resolve address: `/'``).  Generally it's pretty
   easy to make this mistake in shell scripts, but in this case the
   listen port was explicitly set to the empty string before
   command-line parsing, as a default, so the problem would have been
   the same regardless of language.

4. It takes as long to write stuff like this up as it does to
   experience it.

5. Violating established conventions is likely to cause some
   frustration; be sure you're doing it for a good reason.  By
   convention `-v 0` is equivalent to `-v0` when `-v` takes an
   argument; the violation of this convention made the software harder
   to use.

6. `stunnel` rocks and can do what `sslwrap` did.  `CryWrap` sucks and
   can't.

7. OpenSSL has the `openssl s_client` command, which is like an SSL
   version of `netcat`, and also `openssl s_server`.  These should be
   very handy for troubleshooting SSL stuff in general.

8. I'm not a great sysadmin, and I tend to be too persistent when I
   should give up and try something else a little sooner.

Credits
-------

Thanks to Gergely Nagy for writing `CryWrap`, Jonathan McDowell for
maintaining the `sslwrap` Debian package for so long, Rick Kaseguma for
writing `sslwrap` in the first place, Beatrice Murch for having the
patience to help me test the mail server after the upgrade, Paul
Visscher for helping me out with most of the above stuff and also
doing a bunch of the work of the Etch upgrade on our machine, and
Brett Smith and Jason Cook for doing most of the rest of that work.