To crash or not to crash
Eray Ozkural
erayo at cs.bilkent.edu.tr
Thu May 9 20:12:57 PDT 2002
On Friday 10 May 2002 00:58, W Bauske wrote:
> Eray Ozkural wrote:
> > It's very easy to crash a node with a suitable code, so I shouldn't have
> > to re-install it or manually fsck it every time it fails to reboot after
> > such a crash...
>
> How do you "easily" crash a node. Are you exceeding some resource
> limit or??
>
> I run quite large problems and don't see problems. Perhaps you mean
> performance grinds to a halt because of paging or something like that
> which makes the node un-responsive so you power cycle it.
>
Well. :) I think it depends on the application, but it's a sure thing that I
can't provide you with some minimal code that's going to freeze any system
for good. It does happen from time to time, though, more so on certain kernel
version / hardware combinations. It's hard to say when and how those things
happen but exhausting system resources is a good way to disrupt normal
operation as you say. But by crash I mean crash, not temporary inflation of
the working set.
Let me try to give an example to what happens. I sometimes run a large
program, ie one that uses lots of CPU/disk/network, and a node simply goes
down. I'm sure almost everybody has had that kind of thing, for instance some
GL programs used to crash Xfree86 and the whole system rather easily. The
system would lock or reboot right away... If you write algorithms that use a
lot of system resources or do unusual things, you may have done it with your
own user-space code, too.
I have never used a system that cannot be crashed :) If you've used such a
system feel free to advertise it, but linux is certainly not like that :)
(Maybe the *BSD people would want to praise their systems right now :) )
After all, these kinds of things are to be expected because *nobody* can give
a formal proof that the system cannot crash, if you know what I mean.
Unless, of course, the whole system was built upon such an invariant, which is
not the case.
I'm hoping that this gives a little justification to why you would want a
filesystem that will not lose precious files/dirs on an unexpected crash;
well all crashes are unexpected....
Now if the computer that crashes is your home PC, and you are the only user,
it may be possible to predict what might crash your system. Like when you're
testing your uber-kernel-module or superb-ai-algorithm. The problem is even
then you can't guarantee that it won't crash. My claim here is that you can
crash a system with an appropriate user-space code.
On a cluster the probability that one of the nodes might crash is high.
Of course, I would like to have a system that is wholly immune from crashes
but I think it is a little naive to claim that linux cannot crash. The uptime
of some linux boxen does not show that linux is incapable of crashing, it's
simply that the whole system there was at a stable region. Try changing the
system components frequently, and you will get a crash. [*]
Now I won't ever say "crash" again. :) Some people here might want to hear me
say that "linux cannot crash, and ext2 is the best filesystem ever written"
but I won't say it even if Linus Torvalds and gang join this thread :) I
doubt they would say such an over-confident statement :)
And I still think that ext3 is not the only filesystem that is better than
ext2.
You could surely say that linux is more stable than, say, any version of
windows which I would wholeheartedly agree with.
Cheers,
[*] Or maybe it might be said that I haven't configured my systems good
enough, true, but what's the point of an OS if I have to configure it to
prevent it from crashing? :)
--
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara
www: http://www.cs.bilkent.edu.tr/~erayo Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C
More information about the Beowulf
mailing list