[Beowulf] LSI Megaraid stalls system on very high IO?

Jörg Saßmannshausen j.sassmannshausen at ucl.ac.uk
Sat Aug 16 00:46:23 PDT 2014


Hi all

thanks for the thread which was very timely for me. Specially thanks Dimitris 
for your contribution.

My problem: I got some old PCI-X LSI SCSI cards which are connected to some 
Infortrend storage boxes. We recently had a power-dip (lights went off and came 
back within 2 sec) and now the 10 year old frontend is playing up. So I need a 
new frontend and it seems very difficutl to get a PCI-e to PCI-X riser card so I 
can get a newer motherboard with more cores and more memory.

Hence the thread was good for me to read as I hopefully can configure the 
frontend a bit better.

If somebody got any comments on my problem feel free to reply.

David: By the looks of it you will compress larger files on a regular base. 
Have you considered using the parallel version of gzip? Per default it is 
using all available cores but you can change that in the command line. That 
way you might avoid the problem with disc I/O and simply use the available 
cores. You also could do a 'nice' to make sure the machine does not become 
unresponsive due to high CPU load. Just an idea to speed up your 
decompressions. 

All the best from a sunny London

Jörg


On Freitag 15 August 2014 Dimitris Zilaskos wrote:
> Hi,
> 
> I hope your issue has been resolved meanwhile. I had a somehow similar
> mixed experience with Dell branded LSI controllers. It would appear
> that some models are just not fit for particular workloads. I have put
> some information in our blog at
> http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and-
> resolved/
> 
> Cheers,
> 
> Dimitris
> 
> On Thu, Jul 31, 2014 at 7:37 PM, mathog <mathog at caltech.edu> wrote:
> > Any pointers on why a system might appear to "stall" on very high IO
> > through an LSI megaraid adapter?  (dm_raid45, on RHEL 5.10.)
> > 
> > I have been working on another group's big Dell server, which has 16
> > CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid
> > (not sure of the exact configuration and their system admin is out sick)
> > and show up as /dev/sda[abc], where the first two are just under 2 TB
> > and the third is /boot and is about 133 Gb.  sda and sdb are then
> > combined through lvm into one big volume and that is what is mounted.
> > 
> > Yesterday on this system when I ran 14 copies of this simultaneously:
> >   # X is 0-13
> >   gunzip -c bigfile${X}.gz > resultfile${X}
> > 
> > the first time, part way through, all of my terminals locked up for
> > several minutes, and then recovered.  Another similar command had the
> > same issue about half an hour later, but others between and since did
> > not stall.  The size of the files unpacked is only about 0.5Gb, so even
> > if the entire file was stored in memory in the pipes all 14 should have
> > fit in main memory. Nothing else was running (at least that I noticed
> > before or after, something might have started up during the run and
> > ended before I could look for it.) During this period the system would
> > still answer pings.  Nothing showed up in /var/log/messages or dmesg,
> > "last" showed nobody else had logged in, and overnight runs of "smartctl
> > -t long" on the 5 disks were clean - nothing pending, no reallocation
> > events.
> > 
> > Today ran the first set of commands again with "nice 10" and had "top"
> > going and nothing untoward was observed and there were no stalls. On
> > that run iostat showed:
> > 
> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> > sda            6034.00         0.00    529504.00          0     529504
> > sda5           6034.00         0.00    529504.00          0     529504
> > dm-0          68260.00      2056.00    546008.00       2056     546008
> > 
> > 
> > So why the apparent stalls yesterday?  It felt like either my interactive
> > processes were swapped out or they had a much lower priority than enough
> > other processes so that they were not getting any CPU time. Is there some
> > sort of housekeeping that the Megaraid, LVM, or anything normally
> > installed with RHEL 5.10, might need to do, from time to time, that
> > would account for these stalls?
> > 
> > Thanks,
> > 
> > David Mathog
> > mathog at caltech.edu
> > Manager, Sequence Analysis Facility, Biology Division, Caltech
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


-- 
*************************************************************
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


More information about the Beowulf mailing list