[Beowulf] New member, upgrading our existing Beowulf cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Håkon Bugge h-bugge at online.noThu Dec 3 23:29:29 PST 2009
- Previous message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Next message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, On Dec 4, 2009, at 3:34 , Chris Samuel wrote: > > How does it deal with pinned DMA memory on NICs ? What we did in Platform (Scali) MPI, was to drain the HPC interconnect, then close it down. The problem was then reduced to checkpoint (e.g. using BLCR) N processes. Continuing from checkpoint and restarting from it would both re-open the HPC fabric (could be on another physical medium though). You could take the checkpoint on IB and restart using Gbe. Combined with an agnostic interconnect support, this feature allows you in the case of a failing IB HCA (or failing switch port or cable) to restart from last the checkpoint, runn M-1 nodes communicating with other M-2 IB capable nodes using IB, and the last node communicating with the M-1 nodes using Gbe. Traditional checkpointing requires snap-shot of the file-system in the general case (and restore of the correct snap-shot at restart), whereas checkpoint-and-kill (for migration or preemptive batch scheduling) does not require integration with file-systems. Håkon
- Previous message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Next message: [Beowulf] New member, upgrading our existing Beowulf cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
