[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode

Thu Aug 30 22:11:50 PDT 2012

Hi! And thanks for answer, much appreciated!

On 08/31/2012 12:47 AM, Mark Hahn wrote:
>> However, at one point one of the machines serving the file system went
>> down, after spitting out error messages as indicated in
>> it
>> https://bugzilla.redhat.com/show_bug.cgi?id=770545
>>
>> We used the advice indicated in that link ("sysctl -w
>> vm.zone_reclaim_mode=1"), and after that the file servers seems to run
>> OK.
>
> this seems to be quite dependent on hardware, architecture, workload,
> kernel.  did you notice any performance problems?  or try the 
> vm.min_kbytes_free angle?  (also, does this server have swap?)
We use standard centos-6.2 kernel (2.6.32-220.17.1), and I didn't notice 
anything strange on the servers (except the error messages before 
changing zone_reclaim_mode). Regarding min_free_kbytes it is 225280 on 
computational nodes (132 G total memory) and 90112 on file servers (32 G 
total memory). Is this something I should try changing? The servers have 
swap (64 G).

>
>> 1. We had to change the torque submit script like
>>
>> ssh $(hostname) "mpirun -machinefile bla bla bla"
>
> I think this is unrelated.  are you sure nothing changed, torque-wise,
> even its qmgr-level config?  (or mpi versions/config.)
I agree, it seems unrelated, but I can't find anything else that have 
changed!
>
>
>> 3. We have seen particularly lousy performance on one of our 
>> applications.
>
> does it do a lot of file IO?
>
No, and when profiling this I noticed that one particular operation, 
computing the gradient of a vector field, took too long, and besides the 
time to complete this operation varies substantially over the 
iterations. However, when performing this operation a second time (an 
extra "dummy operation") that was NOT that slow. Could this indicate 
that it has something to do with how the memory is handled?

Also, we have used a very similar set up previously, but were all 
machines were running CentOS-5, and then we didn't see these strange 
behaviours.

Thanks again!

/jon