[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode
Jon Tegner
tegner at renget.se
Thu Aug 30 12:24:34 PDT 2012
Hi,
have this strange error. We run CFD calculations on a small cluster.
Basically it consists of bunch of machines connected to a file system.
The file system consists of 4 servers, CentOS-6.2, ext4 and glusterfs
(3.2.7) on top. Infiniband is used for interconnect.
For scheduling/resource management we use torque/maui, and typically we
submit job in a torque submit script like:
mpirun -machinefile bla bla bla
However, at one point one of the machines serving the file system went
down, after spitting out error messages as indicated in
https://bugzilla.redhat.com/show_bug.cgi?id=770545
We used the advice indicated in that link ("sysctl -w
vm.zone_reclaim_mode=1"), and after that the file servers seems to run
OK. This happened in the middle of summer, and a few weeks later we
noticed a few strange things:
1. We had to change the torque submit script like
ssh $(hostname) "mpirun -machinefile bla bla bla"
2. zone_reclaim_node were set to 1 on all computational nodes (on the
file servers this was done explicitly, NOT so on the computational nodes).
3. We have seen particularly lousy performance on one of our applications.
4. The command "tail -f file" doesn't get updated properly.
Any help/hints would be greatly appreciated!
Regards,
/jon
More information about the Beowulf
mailing list