Tue, 24 Jun 2008

I came here because of a death; my friend Eric died a couple of months
ago, and I came for his memorial service a week ago.  I've been spending
the time since then appreciating all the people who aren't dead yet.

Today was the day I had planned to fly back to Argentina, but
unfortunately a number of bureaucratic obstacles have lifted themselves
up in my path.  I could go back to Argentina, but I would probably have
to return to the US to deal with them.  So my departure is delayed until
July 4th.

I have had a wonderful time visiting friends and family here.  Every day
I see people I love whom I hadn't seen since last year, and it is
wonderful.  

But my time has been fairly full.  I've been very lucky in that friends
and family have lent me a house, a laptop, a bicycle, and a cell phone
while I'm here; without these, this level of activity would be pretty
difficult.

Some notable recent days:

Saturday: I went to Bolinas to get the Magic Bus; we think selling it in
San Francisco will be easier than selling it in Bolinas.  It certainly
won't be able to sell for the amount of money we've put into it (US$2200
of work late last year, US$800 or so when I rebuilt the engine, US$2000
to tow it across the country, etc. etc., plus the US$4500 that was its
price when we first got it.)  But maybe we can get some fraction of that
money back.

Monday: I made breakfast for one friend, lunch for another, visited the
California Department of State, went shopping in Chinatown, biked
several miles uphill, and traveled to Pleasanton on BART.

Last Tuesday: Said goodbye to my cousin who's lending me his house, met
a friend in Berkeley for breakfast, visited another friend to see her
lab and pick up the cell phone she was lending me, rode over to San
Mateo with the first friend, visited a company I used to work for, got a
phone card to call Argentina with, met a third friend for dinner in
Berkeley, went to a meeting of some friends in San Francisco to
incorporate a nonprofit (shaving with a dry razor as I walked down the
street to get there), picked up keys to a friend's apartment nearby, and
picked up groceries for breakfast the next morning as I walked back to
BART.

I've somehow managed to keep my expenses relatively reasonable while
doing this.  As of Saturday, my average since arriving in the US had
been US$14.14 per day, about 75% of which had been on public transit.  I
suspect it's gone up since then, largely because of the Magic Bus.
Already, though, that's the same as the rent on our apartment in Buenos
Aires.

Some time this week I will need to drive to Modesto and look through a
storage unit for bureaucratic reasons, which is generally an ordeal in
the summer.  I am hoping I can find an early-rising friend or two to
join me.

Aristotle:

Thank you for the explanation.  What you wrote makes sense.  The  
papers that I referenced [1] didn't analyze the consequences of data  
loss failures, only the mechanism (if any) that the filesystem code  
has for detecting and responding to failures.  Those papers seem to  
suggest that reiser3 is better than the others (but not perfect) at  
the goal of detecting more possible failures, handling them in a fail- 
safe manner (thus trading off availability to gain correctness) and  
handling a variety of failures in a consistent way.  However, if  
there is an error which isn't detected by the filesystem code (such  
as a silent mis-write), then I can see how the less redundant reiser3  
data structure is more brittle.

Thanks again.

> And for my own proclivities, reiserX goes too far toward the
> performance end of the scale.
...
> Hence my general dislike of reiserX.


Here are some personal proclivities of my own:

I don't like ext3.  It seems to be engineered "ad-hoc".  The recent  
revelations that (a) it turns write-barriers off by default, (b)  
nobody knows how much performance delta we're talking about, (c)  
nobody knows how much safety delta we're talking about, is just the  
latest detail to make me think that ext3 is engineered primarily by  
ad-hoc response to complaints (or by ad-hoc improvements).  I won't  
go into more detail.

I like reiser3 in general for the reasons that I listed above.   
However, "in general" is probably not the right way to choose a local  
filesystem.  Rather, the specific use case probably makes all the  
difference.  For Tahoe LAFS [2] storage servers, reiser3 is probably  
a good choice because it runs on Linux, is fast, and packs small  
files for better space efficiency.  Tahoe does not rely on local  
filesystems for data correctness or longevity, so the chance of data  
corruption or loss isn't that important of a criterion, but reiser3's  
tendency (mentioned above) to fail loudly and fail-stop is probably  
better operationally than the alternative of grinding along quietly  
while losing or corrupting data or suffering reduced performance.   
Also the fact that other large data-farming operations like the  
Internet Archive, Mozy, and EMC Centera [4] have used reiser3  
extensively gives me confidence.  Finally, the fact that reiser3 is  
old and does not get tweaked or improved is reassuring -- the worst  
failures we're likely to encounter are new bugs or new filesystem  
"improvements" that we didn't understand.

I like ZFS, and I'm happy using it on (Free Software, Open Source)  
Solaris.  My web server, http://zooko.com is running Nexenta [3]  
which uses ZFS by default.

I like BTRFS, and furthermore I predict that it will be a huge  
success in a few years because (as Andy Isaacson showed me), you can  
upgrade your ext3 filesystem in place to BTRFS, and even revert it  
again to the state that it was in before you upgraded it to BTRFS.  I  
think the main reason that ext3 is the de facto standard nowadays is  
because ext3 was so data-backwards-compatible with ext2, and since  
BTRFS is highly data-backwards-compatible with ext3 (as well as  
having many other great features, as well as being architected by  
Chris Mason who was responsible for much of the good stuff in  
reiser3, as well as being funded and supported by Oracle), then it is  
sure to be a winner.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/wiki/Bibliography#LocalFilesystems
[2] http://allmydata.org
[3] http://nexenta.org
[4] http://lkml.org/lkml/2008/5/18/260

* zooko <zooko at zooko.com> [2008-06-24 02:30]:
> On Jun 23, 2008, at 1:50 PM, Aristotle Pagaltzis wrote:
>> The problem with the Reiser family FSs is that they are
>> inherently brittle. Now that they have been sufficiently
>> debugged, they no longer lose data often, but if you have even
>> a small unrepairable corruption, it is still more likely that
>> you’ll lose half your disk instead of just a few files, as is
>> the extX family’s failure mode.
>
> How do you know this? It doesn't seem to be implied by any of
> the papers that I referenced, but nor is it contradicted by
> them. I would like more data.

Purely from my own reasoning.

In extX, inodes, directories and the alloc bitmap are all
separate, and two of them are randomly accessible linear data
structres. Actually in some important ways even directories are.
There are a few crucial bits of metadata about these data
structures that, if destroyed, would preclude you from finding
them at all (eg. the superblock and such), but those are not
written to during normal operation. If you lose a directory,
the inodes are still there so you lose the tree structure but
none of the contents; if the bitmap is affected, as long as you
notice the inconsistency you lose nothing (so it’s really just
a cache); if you lose inodes, only the files described by the
affected inodes are lost. It’s simply impossible to do much
non-localised damage because the metadata layout has such low
entropy.

Of course that’s also a big reason why it’s impossible to make
extX fast for operations involving a lot of metadata.

In constrast, reiserX mediates all metadata through a Btree. If
you lose any subtree, the entire information about that subtree
becomes unreachable. You can use a carving-type tool and some
heuristics to try to find the metadata after the fact and restore
it as well as possible, but your chances are still mediocre. This
is how reiserX gets its phenomenal speed, of course – every bit
of metadata read from the disk helps avoid having to read more
metadata. Entropy is very high. That’s also the reason for it’s
sky-high CPU cycle consumption.

But it does mean that it is inherently brittle, because you need
all of the participating metadata to get at any piece of data,
whereas in extX a lot of the participating metadata only serves
as middle men providing indirection.

This is an information-theoretically rooted tradeoff. It is
mathematically impossible to make a filesystem both extremely
robust and extremely fast, because those properties lie at
opposite ends of the redundancy scale.

And for my own proclivities, reiserX goes too far toward the
performance end of the scale. At the same time I don’t think
extX is the be-all end-all on its part of the scale; I think
it is entirely posssible to achieve robustness at least close
to that of extX without having to accept nearly as limited
performance.

Hence my general dislike of reiserX.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>