[Beowulf] big read triggers migration and slow memory IO?

Wed Jul 8 15:43:27 PDT 2015

On 8 July 2015 at 22:26, mathog <mathog at caltech.edu> wrote:

> This big Dell (PowerEdge T620/03GCP, 48 CPUs, >500Gb RAM) keeps throwing
> me curve balls.
>
> On the RAID file system there are a bunch of files having about
> 17453170224 bytes.  (Slightly different numbers of fixed length records.)
> At one level these bytes move around very quickly, this takes only 3
> seconds:
>
> dd if=KTEMP1 of=/dev/null bs=8192
>
> (5.8Gb/s) which means it must already be in cache.  Nothing else is going
> on on this system. However, when a program that uses this code (where
> len_file is again 17453170224)
>
>    buffer=malloc(len_file);
>   (void) posix_fadvise(fileno(fin), 0, 0, POSIX_FADV_SEQUENTIAL);
>   (void) posix_madvise(buffer, len_file, POSIX_MADV_SEQUENTIAL);
>    rlen = fread(buffer, 1, len_file, fin);
>
> is run the fread() takes at least 30 seconds, sometimes longer, for the
> read to complete.  The thing is, "top" shows this (sorry about the wrap):
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 22501 mathog    20   0 16.3g  13g  520 R 71.3  2.6   0:44.86 binorder
>    99 root      RT   0     0    0    0 S 16.6  0.0   0:08.75 migration/24
>     3 root      RT   0     0    0    0 S 12.3  0.0   0:24.91 migration/0
>
>
I think you're process is being moved between NUMA nodes and you're losing
locality to the data. Try confining the process and data to the same NUMA
node with the numactl command.

> What happens is that RES quickly jumps up to about half of VIRT and then
> the two migration processes start up, at which point it crawls.
> The numbers after "migration" vary.  dd doesn't run long enough to
> trigger whatever this migration business is.  If my test program
> is run a couple of times in a row sometimes it completes the read
> in about 8 seconds.  When that happens the migration processes will not
> appear.
>
> Through all of this iostat and iotop do not show any IO at all, presumably
> because it is all going between memory and file cache, with none of the
> read being straight from the RAID.
>
> Anyway, using 30s as a nice round number that works out to about 582Mb/s
> to move this data from one section of memory to another.  Which is pretty
> poor since the stream benchmark shows:
>
> Function    Best Rate MB/s  Avg time     Min time     Max time
> Copy:            5737.4     0.027951     0.027887     0.028254
> Scale:           6273.8     0.025557     0.025503     0.025686
> Add:             7632.6     0.031513     0.031444     0.031657
> Triad:           8948.2     0.026896     0.026821     0.027126
>
> all of which are 10x faster.  Note that the dd time is consistent
> with stream's copy benchmark.
>
> Can anybody shed some light on this behavior?  In particular, why does the
> OS feel the need to "migrate" something when one of these huge reads is
> running?  Mostly I want to know how to make it behave, leaving the
> process/memory attached to one CPU (but not a particular CPU, just wherever
> it happens to put it) and not shuffle the data through what seems to be a
> 1/10X speed memory pathway.  Also, is there really a 1/10X memory speed
> pathway on this big box, or is it just that the migration, whatever that is
> doing, has a lot of overhead?
>

Assuming your machine is NUMA (hwloc-ls will show you this) in my
experience some of the E5's have really poor performance for inter-NUMA
communication.

Good luck!

> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Jonathan Barber <jonathan.barber at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150708/141f033f/attachment.html>