[Beowulf] big read triggers migration and slow memory IO?

mathog mathog at caltech.edu
Thu Jul 9 09:40:41 PDT 2015


On 09-Jul-2015 06:48, Stuart Barkley wrote:
> Even though I doubt it is your problem, this smells similar to the
> zone_reclaim_mode issues we saw last year.
> 
> You might check 'sar -B' output.  Specifically the 'pgscand/s' column.

Stays at 0, but see caveat below
> 
> Check the setting of /proc/sys/vm/zone_reclaim_mode (it should be 0).

It is.

The caveat - this morning I cannot make the tests go slow!  Same 
account, same command, same input file.  Apparently the issue depends on 
how the system was used previously and it sorts itself out, eventually, 
on an idle system.  Before this problem was noticed
40 of the 48 nodes had each been used to generate and write one of these 
huge files (17.45GB).  My testing of the read speed went on for about 
four hours after that, and it was uniformly slow for test files over the 
"just below 2^34 byte" limit for my account.  The system then sat idle 
for about 15 hours, and now the performance issue isn't happening, not 
even on a test file twice the size of the largest attempted yesterday.

Interestingly, the "taskset" isn't needed now either.  When the test 
program is run without it it runs nicely and no "migration/#" process 
ever pops up.

Seems like there is some sort of state that the earlier processing 
imposed on the system which caused the OS to be short of who knows what, 
triggering all of these issues when a lot of memory was needed on one 
CPU (or in one process).

I will re-abuse the system and see if that reintroduces the problem.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the Beowulf mailing list