[Beowulf] big read triggers migration and slow memory IO?
mathog
mathog at caltech.edu
Wed Jul 8 14:26:37 PDT 2015
This big Dell (PowerEdge T620/03GCP, 48 CPUs, >500Gb RAM) keeps throwing
me curve balls.
On the RAID file system there are a bunch of files having about
17453170224 bytes. (Slightly different numbers of fixed length
records.) At one level these bytes move around very quickly, this takes
only 3 seconds:
dd if=KTEMP1 of=/dev/null bs=8192
(5.8Gb/s) which means it must already be in cache. Nothing else is
going on on this system. However, when a program that uses this code
(where len_file is again 17453170224)
buffer=malloc(len_file);
(void) posix_fadvise(fileno(fin), 0, 0, POSIX_FADV_SEQUENTIAL);
(void) posix_madvise(buffer, len_file, POSIX_MADV_SEQUENTIAL);
rlen = fread(buffer, 1, len_file, fin);
is run the fread() takes at least 30 seconds, sometimes longer, for the
read to complete. The thing is, "top" shows this (sorry about the
wrap):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22501 mathog 20 0 16.3g 13g 520 R 71.3 2.6 0:44.86 binorder
99 root RT 0 0 0 0 S 16.6 0.0 0:08.75
migration/24
3 root RT 0 0 0 0 S 12.3 0.0 0:24.91 migration/0
What happens is that RES quickly jumps up to about half of VIRT and then
the two migration processes start up, at which point it crawls.
The numbers after "migration" vary. dd doesn't run long enough to
trigger whatever this migration business is. If my test program
is run a couple of times in a row sometimes it completes the read
in about 8 seconds. When that happens the migration processes will not
appear.
Through all of this iostat and iotop do not show any IO at all,
presumably because it is all going between memory and file cache, with
none of the read being straight from the RAID.
Anyway, using 30s as a nice round number that works out to about 582Mb/s
to move this data from one section of memory to another. Which is
pretty poor since the stream benchmark shows:
Function Best Rate MB/s Avg time Min time Max time
Copy: 5737.4 0.027951 0.027887 0.028254
Scale: 6273.8 0.025557 0.025503 0.025686
Add: 7632.6 0.031513 0.031444 0.031657
Triad: 8948.2 0.026896 0.026821 0.027126
all of which are 10x faster. Note that the dd time is consistent
with stream's copy benchmark.
Can anybody shed some light on this behavior? In particular, why does
the OS feel the need to "migrate" something when one of these huge reads
is running? Mostly I want to know how to make it behave, leaving the
process/memory attached to one CPU (but not a particular CPU, just
wherever it happens to put it) and not shuffle the data through what
seems to be a 1/10X speed memory pathway. Also, is there really a 1/10X
memory speed pathway on this big box, or is it just that the migration,
whatever that is doing, has a lot of overhead?
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list