[Beowulf] failure rates

Mon Feb 5 09:48:33 PST 2007

Hi,
I do not know if i can help answering the original question really.
but most of the failures we see from the system side are in that order

hard disks
interconnect cards
misconfigured node
Uncorrected Memory errors
system board failures
Unexplainable failures

failures related to the application itself we do not see them as the
user will resubmit his job and will correct their mistakes quietly.

The question is cluster by definition are not highly available
systems, they are made up of commodity hardware, and if most of these
clusters are using the standard mpi implementation then they will work
on  the principle if it fails stop. and in most of the time failure
investigation is minimal as the importance is getting the node back to
work.

so is failure rate really of concern? if it was so we would see more
of fault tolerance layers in clusters and failure rate metrics in
monitoring tools and reports. I am interested in reducing these
failure rates as user demands are growing instead of using few nodes,
now they are using as much as possible and requesting for even more,
and the more you give them, the more failures we will get!

What will you be trying to achieve with your thesis? will the question
of how the reduce or manage the failures be part of it?

regards

Walid.