[Beowulf] MPI, fault handling, etc.
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Thu Mar 10 05:44:47 PST 2016
This is interesting stuff.
Think back a few years when we were talking about checkpoint/restart issues: as the scale of your problem gets bigger, the time to checkpoint becomes bigger than the time actually doing useful work.
And, of course, the reason we do checkpoint/restart is because it’s bare-metal and easy. Just like simple message passing is “close to the metal” and “straightforward”.
Similarly, there’s “fine grained” error detection and correction: ECC codes in memory; redundant comm links or retries. Each of them imposes some speed/performance penalty (it takes some non-zero time to compute the syndrome bits in a ECC, and some non-zero time to fix the errored bits… in a lot of systems these days, that might be buried in a pipeline, but the delay is there, and affects performance)
I think of ECC as a sort of diffuse fault management: it’s pervasive, uniform, and the performance penalty is applied evenly through the system. Redundant (in the TMR sense) links are the same way.
Retries are a bit different. The “detecting” a fault is diffuse and pervasive (e.g. CRC checks occur on each message), but the correction of the fault is discrete and consumes resources at that time. In a system with tight time coupling (a pipelined systolic array would be the sort of worst case), many nodes have to wait to fix the one that failed.
A lot depends on the application: tighter time coupling is worse than embarrassingly parallel (which is what a lot of the “big data” stuff is: fundamentally EP, scatter the requests, run in parallel, gather the results).
The challenge is doing stuff in between: You may have a flock with excess capacity (just as ECC memory might have 1.5N physical storage bits to be used to store N bits), but how do you automatically distribute the resources to be failure tolerant. The original post in the thread points out that MPI is not a particularly facile tool for doing this. But I’m not sure that there is a tool, and I’m not sure that MPI is the root of the lack of tools. I think it’s that moving from close to the metal is a “hard problem” to do in a generic way. (The issues about 32 bit counts are valid, though)
James Lux, P.E.
Task Manager, DHFR Space Testbed
Jet Propulsion Laboratory
4800 Oak Grove Drive, MS 161-213
Pasadena CA 91109
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf