[Beowulf] Problems with a JS21 - Ah, the networking...

Fri Sep 28 13:43:41 PDT 2007

Hello everybody,

I am beginning to take care of an IBM's JS21. The cluster consists of
112 nodes (8 bladecenters), plus 3 Power5 management nodes (1 headnode
and 2 storage) and a DS4200 storage array. We are using now GPFS as
the file system in the cluster in a gigabit dedicated service network
(using a Force10 S switch), and Myrinet 2000 for mpi. And now comes
the story...

The system was originally configured as NFS exported to the nodes and
GPFS between two power5 store nodes (under then there is a storage
array DS4200 using raid5). NFS died badly, letting lots of badcalls
and badclnt. The acess time and copy time was terrible, and sometimes
the connection just died. The GPFS daemon, mmfsd, in the primary NSD
was stuck at 100% CPU. ssh did not show any problems then, so it was
some sort of problem with NFS or the network. Then, was decided to
change the NFS to GPFS in the entire cluster, restarting also the
mmfsd daemon, and that worked - all the nodes had their file systems
accessible again.

But sometimes, completely random, some node will be removed from the
GPFS structure - the error message is about "expired lease". This is
still happening. The failures occurs from 5 to 5 days, on average,
with or without load, and randomly in the cluster. The node is
recovered back to GPFS after a few seconds. I wrote a script that
checks if a node is disconnected from the GPFS, and then just pings
the disconnected node. The node had connectivity when the GPFS failed.

I sniffed the network in the store nodes interface, and i got lots of
TCP lost fragment, previos lost fragments, ack lost fragments and TCP
window size full. The GPFS is now heavily used.

In the meantime...

The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,
this happened with the simple mpich examples like cpi, cpilog, etc. We
are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
shows all nodes connected when this happens, and the switch did not
overheat. mpirun.ch_mx -v shows that all the processes are issued ok
to the nodes, but somehow one (or more) process go to sleep or never
starts, and all the other processes just hangs. The mx diagnose tools
did not show any problem so far, but we still did not have done a
mx_pingpong, for example, because of we still have some users using
the cluster. The is no error whatsoever in the myrinet logs or the
system logs.
The operational system is Suse Entreprise 9, the kernel version 2.6.5-7.244.

We have another problems (like some BA060021 errors on the
bladecenters logs, and a PIO drv_stats x51 filling dmesg in the
headnode), but these connection things are the main problems now.

Any suggestions? I can provide any log necessary.

Thank you!
-- 
-----------------------------------------------------------
Ivan S. P. Marin
----------------------------------------------------------