[Beowulf] LSI Megaraid stalls system on very high IO?

Tue Aug 19 15:08:16 PDT 2014

Hi Greg,

thanks for the email. I agree, I will be lucky to get such a machine. 

What I probably will do is go for a modern motherboard and try and get a PIC-e 
SCSI card. I hope at least they exist....

All the best from a cold London

Jörg

On Montag 18 August 2014 Gregory Matthews wrote:
> On 16/08/14 08:46, Jörg Saßmannshausen wrote:
> > My problem: I got some old PCI-X LSI SCSI cards which are connected to
> > some Infortrend storage boxes. We recently had a power-dip (lights went
> > off and came back within 2 sec) and now the 10 year old frontend is
> > playing up. So I need a new frontend and it seems very difficutl to get
> > a PCI-e to PCI-X riser card so I can get a newer motherboard with more
> > cores and more memory.
> 
> good luck with that! Those technologies are pretty incompatible. There
> are one or two PCIe (x1) to PCI (maybe compatible with PCI-X - check
> voltages etc.) converters but I wouldn't trust them with my storage.
> 
> The last server we bought that was still compatible with PCI-X was a
> Dell Poweredge R200, you needed to specify PCI-X riser when buying.
> Maybe ebay is your best bet at this point?
> 
> GREG
> 
> > Hence the thread was good for me to read as I hopefully can configure the
> > frontend a bit better.
> > 
> > If somebody got any comments on my problem feel free to reply.
> > 
> > David: By the looks of it you will compress larger files on a regular
> > base. Have you considered using the parallel version of gzip? Per
> > default it is using all available cores but you can change that in the
> > command line. That way you might avoid the problem with disc I/O and
> > simply use the available cores. You also could do a 'nice' to make sure
> > the machine does not become unresponsive due to high CPU load. Just an
> > idea to speed up your decompressions.
> > 
> > All the best from a sunny London
> > 
> > Jörg
> > 
> > On Freitag 15 August 2014 Dimitris Zilaskos wrote:
> >> Hi,
> >> 
> >> I hope your issue has been resolved meanwhile. I had a somehow similar
> >> mixed experience with Dell branded LSI controllers. It would appear
> >> that some models are just not fit for particular workloads. I have put
> >> some information in our blog at
> >> http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-a
> >> nd- resolved/
> >> 
> >> Cheers,
> >> 
> >> Dimitris
> >> 
> >> On Thu, Jul 31, 2014 at 7:37 PM, mathog <mathog at caltech.edu> wrote:
> >>> Any pointers on why a system might appear to "stall" on very high IO
> >>> through an LSI megaraid adapter?  (dm_raid45, on RHEL 5.10.)
> >>> 
> >>> I have been working on another group's big Dell server, which has 16
> >>> CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid
> >>> (not sure of the exact configuration and their system admin is out
> >>> sick) and show up as /dev/sda[abc], where the first two are just under
> >>> 2 TB and the third is /boot and is about 133 Gb.  sda and sdb are then
> >>> combined through lvm into one big volume and that is what is mounted.
> >>> 
> >>> Yesterday on this system when I ran 14 copies of this simultaneously:
> >>>    # X is 0-13
> >>>    gunzip -c bigfile${X}.gz > resultfile${X}
> >>> 
> >>> the first time, part way through, all of my terminals locked up for
> >>> several minutes, and then recovered.  Another similar command had the
> >>> same issue about half an hour later, but others between and since did
> >>> not stall.  The size of the files unpacked is only about 0.5Gb, so even
> >>> if the entire file was stored in memory in the pipes all 14 should have
> >>> fit in main memory. Nothing else was running (at least that I noticed
> >>> before or after, something might have started up during the run and
> >>> ended before I could look for it.) During this period the system would
> >>> still answer pings.  Nothing showed up in /var/log/messages or dmesg,
> >>> "last" showed nobody else had logged in, and overnight runs of
> >>> "smartctl -t long" on the 5 disks were clean - nothing pending, no
> >>> reallocation events.
> >>> 
> >>> Today ran the first set of commands again with "nice 10" and had "top"
> >>> going and nothing untoward was observed and there were no stalls. On
> >>> that run iostat showed:
> >>> 
> >>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> >>> sda            6034.00         0.00    529504.00          0     529504
> >>> sda5           6034.00         0.00    529504.00          0     529504
> >>> dm-0          68260.00      2056.00    546008.00       2056     546008
> >>> 
> >>> 
> >>> So why the apparent stalls yesterday?  It felt like either my
> >>> interactive processes were swapped out or they had a much lower
> >>> priority than enough other processes so that they were not getting any
> >>> CPU time. Is there some sort of housekeeping that the Megaraid, LVM,
> >>> or anything normally installed with RHEL 5.10, might need to do, from
> >>> time to time, that would account for these stalls?
> >>> 
> >>> Thanks,
> >>> 
> >>> David Mathog
> >>> mathog at caltech.edu
> >>> Manager, Sequence Analysis Facility, Biology Division, Caltech
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >>> Computing To change your subscription (digest mode or unsubscribe)
> >>> visit http://www.beowulf.org/mailman/listinfo/beowulf
> >> 
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
*************************************************************
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html