[Beowulf] Slow RAID reads, no errors logged, why?
David Mathog
mathog at caltech.edu
Mon Mar 19 13:58:12 PDT 2018
On one of our Centos 6.9 systems with a PERC H370 controller I just
noticed
that file system reads are quite slow. Like 30Mb/s slow. Anybody care
to hazard a guess what might be causing this situation? We have another
quite similar machine which is fast (A), compared to this (B) which is
slow:
A B
RAM 512 512 GB
CPUs 48 56 (via /proc/cpuinfo, actually this is threads)
Adapter H710P H730
RAID Level * * Primary-5, Secondary-0, RAID Level Qualifier-3
Size 7.275 9.093 TB
state * * Optimal
Drives 5 6
read rate 540 30 Mb/s (dd if=largefile bs=8192 of=/dev/null& ;
iotop)
sata disk ST2000NM0033
sas disk ST2000NM0023
patrol No No (megacli shows patrol read not going now)
ulimit -a on both is:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2067196
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 60000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Nothing in the SMART values indicating a read problem, although on "B"
one disk is slowly accumulating events in the write x rereads/rewrites
measurement (it has 2346, accumulated at about 10 per week). The value
is 0 there for reads x rereads/rewrites. For "B" the smartctl output
columns are:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed
uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 934353848 0 0 934353848 0 48544.026 0
read: 2017672022 0 0 2017672022 0 48574.489 0
read: 2605398517 3 0 2605398520 3 48516.951 0
read: 3237457411 1 0 3237457412 1 48501.302 0
read: 2028103953 0 0 2028103953 0 14438.132 0
read: 197018276 0 0 197018276 0 48640.023 0
write: 0 0 0 0 0 26394.472 0
write: 0 0 2346 2346 2346 26541.534 0
write: 0 0 0 0 0 27549.205 0
write: 0 0 0 0 0 25779.557 0
write: 0 0 0 0 0 11266.293 0
write: 0 0 0 0 0 26465.227 0
verify: 341863005 0 0 341863005 0 241374.368 0
verify: 866033815 0 0 866033815 0 223849.660 0
verify: 2925377128 0 0 2925377128 0 221697.809 0
verify: 1911833396 6 0 1911833402 6 228054.383 0
verify: 192670736 0 0 192670736 0 66322.573 0
verify: 1181681503 0 0 1181681503 0 222556.693 0
If the process doing the IO is root it doesn't go any faster.
Oddly if on "B" a second dd process is started on another file it ALSO
reads at 30Mb/s. So the disk system then does a total of 60Gb/s, but
only 30Gb/s per process. Added a 3rd and a 4th process doing the same.
At the 4th it seemed to hit some sort of limit, with each process now
consistently less than 30Gb/s and the total at maybe 80Gb/s total. Hard
to say what the exact total was as it was jumping around like crazy. On
"A" 2 processes each got 270Mb/s,
and 3 180Mb/s. Didn't try 4.
The only oddness of late on "B" is that a few days ago it loaded too
many memory hungry processes so the OS killed some. I have had that
happen before on other systems without them doing anything odd
afterwards.
Any ideas what this slowdown might be?
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list