[Beowulf] Anyone know the LinkAggregation(Trunking) on a switch?
Robert G. Brown
rgb at phy.duke.edu
Tue Sep 6 10:07:26 PDT 2005
Michael Will writes:
> What is the exact definition of a failed machine.
> - turned off
> - kernel paniced, halted but still on
> - above certain limit load ?
One with its head down and a dejected look on its face.
Sometimes you find them sleeping in the the lower slots of a decrepit
old rack or cadging packets off of the network that are intended for
newer, faster machines. Many of them are chronic smokers and just plain
stink up the neighborhood.
These machines need our sympathy. I run a home for them, and if you
have any of them in your racks you should feel free to send them on to
me. A few months of computing something like mandelbrot set problems or
something embarrassingly parallel is often enough to permit me to return
them, revitalized, into a dynamic and successful hpc environment.
(Sorry, but since Seth left us -- and left me doing at least part of his
job until his replacement arrives -- I'm feeling REALLY RANDOM.
> Do you use heartbeat to detect those and STONITH in order to turn off a
> failed machine?
OK, now I'll try to be serious. There is no exact definition.
A functional definition is any machine that is supposed to be working on
a given computation but for any reason at all is not. Then you can
start playing semantic games with how BADLY it has failed and WHY it has
failed, and not worry about the definitions so much.
So all of the above are failed machines in a way. Some of them are
reversible conditions or easily cured conditions. Others may be more
severe. Of the severe ones, some are hardware problems (the damn thing
is physically broken) and others are software (the damn thing is
So a more apropos question is "for this particular kind of situation,
what are you counting as a failure", or something like that, which has a
more limited and specific answer, perhaps.
But I like my first answer better:-)
> Zhang Hui wrote:
>> I have got a problem with the trunking failover on a switch.
>> I have implemented the multi-machine Trunking(one link per machine).The packets can be distributed between the links/machines and be at last forwarded to a real server via the IPVS(by Zhang Wensong),like this:
>>| | |------| ________
>>|.|__/\___| 1 |_____| |
>>| | || |______| | real |
>>| | ||trunk | |
>>| | || |------| |server|
>>|.|__||___| 2 |_____| |
>>|_| \/ |______| |______|
>> And when one link(to one machine) is down, the connection will be transfered to another link/machine, and to "server" at last,session kept.
>> The problem is, when, for example, "1" is down, but the link to "1" is still up(judge from the LED for "1" on the switch).So the "switch" won't think "1" is down, and distribute packets to "1" as usual.Therefore the connection is down.
>> Can't the switch sense the death of a machine,intrinsically? Or something wrone with the configuration of Trunking in the "switch"?
>> By the way, the "switch" is a 3com SS3300TM 16986A one.
>> Can anyone help me? Great appreciation to any reply.
>> Zhang Hui
>> spacetiller at 163.com
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> Michael Will
> Penguin Computing Corp.
> Sales Engineer
> 415-954-2899 fx
> mwill at penguincomputing.com
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the Beowulf