[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode

Thu Aug 30 12:24:34 PDT 2012

Hi,

have this strange error. We run CFD calculations on a small cluster.  
Basically it consists of bunch of machines connected to a file system.  
The file system consists of 4 servers, CentOS-6.2, ext4 and glusterfs 
(3.2.7) on top. Infiniband is used for interconnect.

For scheduling/resource management we use torque/maui, and typically we 
submit job in a torque submit script like:

mpirun -machinefile bla bla bla

However, at one point one of the machines serving the file system went 
down, after spitting out error messages as indicated in

https://bugzilla.redhat.com/show_bug.cgi?id=770545

We used the advice indicated in that link ("sysctl -w 
vm.zone_reclaim_mode=1"), and after that the file servers seems to run 
OK. This happened in the middle of summer, and a few weeks later we 
noticed a few strange things:

1. We had to change the torque submit script like

ssh $(hostname) "mpirun -machinefile bla bla bla"

2. zone_reclaim_node were set to 1 on all computational nodes (on the 
file servers this was done explicitly, NOT so on the computational nodes).

3. We have seen particularly lousy performance on one of our applications.

4. The command "tail -f file" doesn't get updated properly.

Any help/hints would be greatly appreciated!

Regards,

/jon