[Beowulf] DMA Memory Mapping Question

Wed Feb 21 20:06:50 PST 2007

Hi Chris,

Chris Samuel wrote:
> We occasionally get users who manage to use up all the DMA memory that is 
> addressable by the Myrinet card through the Power5 hypervisor.

The IOMMU limit set by the hypervisor varies depending on the machine, 
the hypervisor version and the phase of the moon. Sometimes, it's a 
limit per PCI slot (ie per device), sometimes it  is a limit for the 
whole machine (can be virtual machine, that's one of the reason behind 
the hypervisor) and it's shared by all the devices. Sometimes, it's 
reasonable large (1 or 2 GB), sometimes it is ridiculously small (256 MB).

The hypervisor does not make a lot of sense in a HPC environment, but it 
would be non-trivial work to remove it on PPC.

> Through various firmware and driver tweaks (thanks to both IBM and Myrinet) 
> we've gotten that limit up to almost 1GB and then we use an undocumented 
> environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB of that 
> per process (as we've got 4 cores in each box), which we enforce through 
> Torque.
> 
> The problems went away.  Or at least it did until just now. :-(
> 
> The characterstic error we get is:
> 
> [13]: alloc_failed, not enough memory (Fatal Error)
>         Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
> 
> Now Myrinet can handle running out of DMA memory once a process is running, 
> but when it starts it must be able to allocate a (fairly trivial) amount of 
> DMA memory otherwise you get that fatal error.

GM does pipeline large messages with chunks of 1 MB, so you can progress 
as long as you can register 1 MB at a time (you can think of 
pathological deadlocking situations, but it's not the common case). 
However, GM registers some buffers for Eager messages at init time. From 
memory, it's in the order of 32 MB per process (constant, does not 
depend on the size of the job). If you can't register that, there is 
nothing you can do so aborting is a good idea.

If you limit registration per process, then I can think of one situation 
that will hit the IOMMU limit: if a process dies of abnormal death 
(segfault, killed, whatever), the GM port will be "shutting down" while 
the outstanding messages are dropped. During this time, the memory is 
still registered. If you start another process at that time, you will 
effectively have more than 4 processes with registered memory, and it 
may exceed the limit. A quick workaround would be to modify the MPICH-GM 
init code to only try to open the first 4 GM ports. That will in effect 
guarantee that only 4 processes can register memory at one time (latest 
release of GM provides 13 ports).

I see from your next post that it's not what happened. It could have :-)

> Looking at the node I can confirm that there are only 3 user processes 
> running, so what I am after is a way of determining how much of that DMA 
> memory a process has allocated.

There is no handy way, but it would not be hard to add this info to the 
output of gm_board_info. There is not many releases of GM these days. 
Nevertheless, I will add it to the queue, it's simple enough to not be 
considered a new feature.

> Oh - switching to the Myrinet MX drivers (which doesn't have this problem) is 
> not an option, we have an awful lot of users, mostly (non-computer)

Actually, MX would not behave well in your environment: MX does not 
pipeline large messages, it register the whole message at once (MX 
registration is much faster, and pipelining prevents overlap of 
communication with computation). With a 250 MB of DMA-able memory per 
process, that would be the maximum message size you can send or receive.

We have plan to do something about that, but it's not at the top of the 
queue. The right thing would be to get rid of the hypervisor (by the 
way, the hypervisor makes the memory registration overhead much more 
expensive), but it probably will never happen.

> scientists, who have their own codes and trying to persuade them to recompile 
> would be very hard - which would be necessary as we've not been able to 
> convince MPICH-GM to build shared libraries on Linux on Power with the IBM 
> compilers. :-(

Time for dreaming about an MPI ABI :-)

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com