[Beowulf] LSI Megaraid stalls system on very high IO?

Fri Aug 15 09:58:27 PDT 2014

Hi,

I hope your issue has been resolved meanwhile. I had a somehow similar
mixed experience with Dell branded LSI controllers. It would appear
that some models are just not fit for particular workloads. I have put
some information in our blog at
http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and-resolved/

Cheers,

Dimitris

On Thu, Jul 31, 2014 at 7:37 PM, mathog <mathog at caltech.edu> wrote:
> Any pointers on why a system might appear to "stall" on very high IO through
> an LSI megaraid adapter?  (dm_raid45, on RHEL 5.10.)
>
> I have been working on another group's big Dell server, which has 16 CPUs,
> 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid (not sure
> of the exact configuration and their system admin is out sick) and show up
> as /dev/sda[abc], where the first two are just under 2 TB and the third is
> /boot and is about 133 Gb.  sda and sdb are then combined through lvm into
> one big volume and that is what is mounted.
>
> Yesterday on this system when I ran 14 copies of this simultaneously:
>
>   # X is 0-13
>   gunzip -c bigfile${X}.gz > resultfile${X}
>
> the first time, part way through, all of my terminals locked up for several
> minutes, and then recovered.  Another similar command had the same issue
> about half an hour later, but others between and since did not stall.  The
> size of the files unpacked is only about 0.5Gb, so even if the entire file
> was stored in memory in the pipes all 14 should have fit in main memory.
> Nothing else was running (at least that I noticed before or after, something
> might have started up during the run and ended before I could look for it.)
> During this period the system would still answer pings.  Nothing showed up
> in /var/log/messages or dmesg, "last" showed nobody else had logged in, and
> overnight runs of "smartctl -t long" on the 5 disks were clean - nothing
> pending, no reallocation events.
>
> Today ran the first set of commands again with "nice 10" and had "top" going
> and nothing untoward was observed and there were no stalls. On that run
> iostat showed:
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda            6034.00         0.00    529504.00          0     529504
> sda5           6034.00         0.00    529504.00          0     529504
> dm-0          68260.00      2056.00    546008.00       2056     546008
>
>
> So why the apparent stalls yesterday?  It felt like either my interactive
> processes were swapped out or they had a much lower priority than enough
> other processes so that they were not getting any CPU time. Is there some
> sort of housekeeping that the Megaraid, LVM, or anything normally installed
> with RHEL 5.10, might need to do, from time to time, that would account for
> these stalls?
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf