Thu, 03 Jul 2008

* Kragen Javier Sitaker <kragen at canonical.org> [2008-07-03 09:40]:
> I agree that the code needs rewriting, but inspired by Scheme,
> I would do it without magic numbers, without nine separate
> variables for the struct options pointers, and with conditions
> and their consequents on the same line:

Your version isn’t tabular though. This is something I like about
both of the other rewrites better than yours. However, Spinellis
waffles on way too long to determine which array element to pick
and Steele commits the terrible sin of copypasting the x-related
conditional onto every one of the three lines.

Here’s how I’d fix both of these problems:

    struct options *locations[3][3] = {
         {upleft,  upper,  upright },
         {left,    normal, right   },
         {lowleft, lower,  lowright},
    };
    int h = x == last   ? 2 : !!x;
    int v = y == bottom ? 2 : !!y;
    op = &(locations[v][h])[w->orientation];

This makes use of magic numbers and exploits the C type system
quirks relating to ints and boolean logic. Is that bad style?

Or is it idiomatic C?

In any case it’s short enough to grasp quickly and the central
intent of the code is communicated visually by a table. I believe
quite strongly that code driven by tabular declarative sections,
even if the imperative part is somewhat tricky or even messy, is
easier to understand than simpler but more abstractly written code.

I do not see this as the be-all end-all version either, btw; if
there was a way to get the offset of a named member of a struct
in C, it would be possible to meld the advantages of your code
and mine. Think of the machine code that would be generated from
my code: there will multiplications with sizeof(options) in
there. The machine code version of your code would do the job
directly.

-- 
*AUTOLOAD=*_;sub _{s/(.*)::(.*)/print$2,(",$\/"," ")[defined wantarray]/e;$1}
&Just->another->Perl->hack;
#Aristotle Pagaltzis // <http://plasmasturm.org/>

Tue, 24 Jun 2008

Aristotle:

Thank you for the explanation.  What you wrote makes sense.  The  
papers that I referenced [1] didn't analyze the consequences of data  
loss failures, only the mechanism (if any) that the filesystem code  
has for detecting and responding to failures.  Those papers seem to  
suggest that reiser3 is better than the others (but not perfect) at  
the goal of detecting more possible failures, handling them in a fail- 
safe manner (thus trading off availability to gain correctness) and  
handling a variety of failures in a consistent way.  However, if  
there is an error which isn't detected by the filesystem code (such  
as a silent mis-write), then I can see how the less redundant reiser3  
data structure is more brittle.

Thanks again.

> And for my own proclivities, reiserX goes too far toward the
> performance end of the scale.
...
> Hence my general dislike of reiserX.


Here are some personal proclivities of my own:

I don't like ext3.  It seems to be engineered "ad-hoc".  The recent  
revelations that (a) it turns write-barriers off by default, (b)  
nobody knows how much performance delta we're talking about, (c)  
nobody knows how much safety delta we're talking about, is just the  
latest detail to make me think that ext3 is engineered primarily by  
ad-hoc response to complaints (or by ad-hoc improvements).  I won't  
go into more detail.

I like reiser3 in general for the reasons that I listed above.   
However, "in general" is probably not the right way to choose a local  
filesystem.  Rather, the specific use case probably makes all the  
difference.  For Tahoe LAFS [2] storage servers, reiser3 is probably  
a good choice because it runs on Linux, is fast, and packs small  
files for better space efficiency.  Tahoe does not rely on local  
filesystems for data correctness or longevity, so the chance of data  
corruption or loss isn't that important of a criterion, but reiser3's  
tendency (mentioned above) to fail loudly and fail-stop is probably  
better operationally than the alternative of grinding along quietly  
while losing or corrupting data or suffering reduced performance.   
Also the fact that other large data-farming operations like the  
Internet Archive, Mozy, and EMC Centera [4] have used reiser3  
extensively gives me confidence.  Finally, the fact that reiser3 is  
old and does not get tweaked or improved is reassuring -- the worst  
failures we're likely to encounter are new bugs or new filesystem  
"improvements" that we didn't understand.

I like ZFS, and I'm happy using it on (Free Software, Open Source)  
Solaris.  My web server, http://zooko.com is running Nexenta [3]  
which uses ZFS by default.

I like BTRFS, and furthermore I predict that it will be a huge  
success in a few years because (as Andy Isaacson showed me), you can  
upgrade your ext3 filesystem in place to BTRFS, and even revert it  
again to the state that it was in before you upgraded it to BTRFS.  I  
think the main reason that ext3 is the de facto standard nowadays is  
because ext3 was so data-backwards-compatible with ext2, and since  
BTRFS is highly data-backwards-compatible with ext3 (as well as  
having many other great features, as well as being architected by  
Chris Mason who was responsible for much of the good stuff in  
reiser3, as well as being funded and supported by Oracle), then it is  
sure to be a winner.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/wiki/Bibliography#LocalFilesystems
[2] http://allmydata.org
[3] http://nexenta.org
[4] http://lkml.org/lkml/2008/5/18/260

* zooko <zooko at zooko.com> [2008-06-24 02:30]:
> On Jun 23, 2008, at 1:50 PM, Aristotle Pagaltzis wrote:
>> The problem with the Reiser family FSs is that they are
>> inherently brittle. Now that they have been sufficiently
>> debugged, they no longer lose data often, but if you have even
>> a small unrepairable corruption, it is still more likely that
>> you’ll lose half your disk instead of just a few files, as is
>> the extX family’s failure mode.
>
> How do you know this? It doesn't seem to be implied by any of
> the papers that I referenced, but nor is it contradicted by
> them. I would like more data.

Purely from my own reasoning.

In extX, inodes, directories and the alloc bitmap are all
separate, and two of them are randomly accessible linear data
structres. Actually in some important ways even directories are.
There are a few crucial bits of metadata about these data
structures that, if destroyed, would preclude you from finding
them at all (eg. the superblock and such), but those are not
written to during normal operation. If you lose a directory,
the inodes are still there so you lose the tree structure but
none of the contents; if the bitmap is affected, as long as you
notice the inconsistency you lose nothing (so it’s really just
a cache); if you lose inodes, only the files described by the
affected inodes are lost. It’s simply impossible to do much
non-localised damage because the metadata layout has such low
entropy.

Of course that’s also a big reason why it’s impossible to make
extX fast for operations involving a lot of metadata.

In constrast, reiserX mediates all metadata through a Btree. If
you lose any subtree, the entire information about that subtree
becomes unreachable. You can use a carving-type tool and some
heuristics to try to find the metadata after the fact and restore
it as well as possible, but your chances are still mediocre. This
is how reiserX gets its phenomenal speed, of course – every bit
of metadata read from the disk helps avoid having to read more
metadata. Entropy is very high. That’s also the reason for it’s
sky-high CPU cycle consumption.

But it does mean that it is inherently brittle, because you need
all of the participating metadata to get at any piece of data,
whereas in extX a lot of the participating metadata only serves
as middle men providing indirection.

This is an information-theoretically rooted tradeoff. It is
mathematically impossible to make a filesystem both extremely
robust and extremely fast, because those properties lie at
opposite ends of the redundancy scale.

And for my own proclivities, reiserX goes too far toward the
performance end of the scale. At the same time I don’t think
extX is the be-all end-all on its part of the scale; I think
it is entirely posssible to achieve robustness at least close
to that of extX without having to accept nearly as limited
performance.

Hence my general dislike of reiserX.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Mon, 23 Jun 2008

On Jun 23, 2008, at 1:50 PM, Aristotle Pagaltzis wrote:

> The problem with the Reiser family FSs is that they are
> inherently brittle. Now that they have been sufficiently
> debugged, they no longer lose data often, but if you have
> even a small unrepairable corruption, it is still more likely
> that you’ll lose half your disk instead of just a few files, as
> is the extX family’s failure mode.

How do you know this?  It doesn't seem to be implied by any of the  
papers that I referenced, but nor is it contradicted by them.  I  
would like more data.

Regards,

Zooko

* zooko <zooko at zooko.com> [2008-06-23 20:35]:
> I'm not sure what the use case we're talking about is,
> but for many use cases reiser3 is a good choice.

The problem with the Reiser family FSs is that they are
inherently brittle. Now that they have been sufficiently
debugged, they no longer lose data often, but if you have
even a small unrepairable corruption, it is still more likely
that you’ll lose half your disk instead of just a few files, as
is the extX family’s failure mode.

So even now I would not entrust either Reiser-family FS with data
that I care about. Hence my suggestion of putting the less huge-
tree-performance sensitive git store on another partition and
keeping only the working copy on reiserX.

The other thing about the Reiser-family FSs is that they eat
enormous amounts of CPU, to the point that they can *fully peg*
a CPU during I/O.

> Andrew Morton vaguely recalled something like 30% performance
> loss when turning on write barriers for ext3.

Yes, ext3 is slow. It is much faster than the Btree-based FSs in
a select few circumstances, but in general, it is slow or very
slow.

> Chris Mason (who worked on reiser3 at the time and is I think
> chief architect of btrfs now) cooked up a test script which
> could cause filesystem corruption in ext3 with about 50%
> probability in case of power loss.

Right, but how much data gets lost in such a case? In my
experience and that of every sysadmin I’ve asked, unrecoverable
corruption is essentially always quite localised with extX.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>