[Beowulf] DMA Memory Mapping Question
Chris Samuel
csamuel at vpac.org
Wed Feb 21 16:45:53 PST 2007
Hi folks,
We've got an IBM Power5 cluster running SLES9 and using the GM drivers.
We occasionally get users who manage to use up all the DMA memory that is
addressable by the Myrinet card through the Power5 hypervisor.
Through various firmware and driver tweaks (thanks to both IBM and Myrinet)
we've gotten that limit up to almost 1GB and then we use an undocumented
environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB of that
per process (as we've got 4 cores in each box), which we enforce through
Torque.
The problems went away. Or at least it did until just now. :-(
The characterstic error we get is:
[13]: alloc_failed, not enough memory (Fatal Error)
Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
Now Myrinet can handle running out of DMA memory once a process is running,
but when it starts it must be able to allocate a (fairly trivial) amount of
DMA memory otherwise you get that fatal error.
Looking at the node I can confirm that there are only 3 user processes
running, so what I am after is a way of determining how much of that DMA
memory a process has allocated.
I looked at /proc/${PID}/maps and saw this:
40028000-40029000 r--s 00002000 00:0c \
8483 /dev/gm0
which to me looks like a memory mapping, but to my eyes that looks like just
1,000 bytes..
Does anyone have any ideas at all ?
Oh - switching to the Myrinet MX drivers (which doesn't have this problem) is
not an option, we have an awful lot of users, mostly (non-computer)
scientists, who have their own codes and trying to persuade them to recompile
would be very hard - which would be necessary as we've not been able to
convince MPICH-GM to build shared libraries on Linux on Power with the IBM
compilers. :-(
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070222/5bb05d3a/attachment.sig>
More information about the Beowulf
mailing list