[Beowulf] LSI Megaraid stalls system on very high IO?
Jörg Saßmannshausen
j.sassmannshausen at ucl.ac.uk
Tue Aug 19 15:08:16 PDT 2014
Hi Greg,
thanks for the email. I agree, I will be lucky to get such a machine.
What I probably will do is go for a modern motherboard and try and get a PIC-e
SCSI card. I hope at least they exist....
All the best from a cold London
Jörg
On Montag 18 August 2014 Gregory Matthews wrote:
> On 16/08/14 08:46, Jörg Saßmannshausen wrote:
> > My problem: I got some old PCI-X LSI SCSI cards which are connected to
> > some Infortrend storage boxes. We recently had a power-dip (lights went
> > off and came back within 2 sec) and now the 10 year old frontend is
> > playing up. So I need a new frontend and it seems very difficutl to get
> > a PCI-e to PCI-X riser card so I can get a newer motherboard with more
> > cores and more memory.
>
> good luck with that! Those technologies are pretty incompatible. There
> are one or two PCIe (x1) to PCI (maybe compatible with PCI-X - check
> voltages etc.) converters but I wouldn't trust them with my storage.
>
> The last server we bought that was still compatible with PCI-X was a
> Dell Poweredge R200, you needed to specify PCI-X riser when buying.
> Maybe ebay is your best bet at this point?
>
> GREG
>
> > Hence the thread was good for me to read as I hopefully can configure the
> > frontend a bit better.
> >
> > If somebody got any comments on my problem feel free to reply.
> >
> > David: By the looks of it you will compress larger files on a regular
> > base. Have you considered using the parallel version of gzip? Per
> > default it is using all available cores but you can change that in the
> > command line. That way you might avoid the problem with disc I/O and
> > simply use the available cores. You also could do a 'nice' to make sure
> > the machine does not become unresponsive due to high CPU load. Just an
> > idea to speed up your decompressions.
> >
> > All the best from a sunny London
> >
> > Jörg
> >
> > On Freitag 15 August 2014 Dimitris Zilaskos wrote:
> >> Hi,
> >>
> >> I hope your issue has been resolved meanwhile. I had a somehow similar
> >> mixed experience with Dell branded LSI controllers. It would appear
> >> that some models are just not fit for particular workloads. I have put
> >> some information in our blog at
> >> http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-a
> >> nd- resolved/
> >>
> >> Cheers,
> >>
> >> Dimitris
> >>
> >> On Thu, Jul 31, 2014 at 7:37 PM, mathog <mathog at caltech.edu> wrote:
> >>> Any pointers on why a system might appear to "stall" on very high IO
> >>> through an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.)
> >>>
> >>> I have been working on another group's big Dell server, which has 16
> >>> CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid
> >>> (not sure of the exact configuration and their system admin is out
> >>> sick) and show up as /dev/sda[abc], where the first two are just under
> >>> 2 TB and the third is /boot and is about 133 Gb. sda and sdb are then
> >>> combined through lvm into one big volume and that is what is mounted.
> >>>
> >>> Yesterday on this system when I ran 14 copies of this simultaneously:
> >>> # X is 0-13
> >>> gunzip -c bigfile${X}.gz > resultfile${X}
> >>>
> >>> the first time, part way through, all of my terminals locked up for
> >>> several minutes, and then recovered. Another similar command had the
> >>> same issue about half an hour later, but others between and since did
> >>> not stall. The size of the files unpacked is only about 0.5Gb, so even
> >>> if the entire file was stored in memory in the pipes all 14 should have
> >>> fit in main memory. Nothing else was running (at least that I noticed
> >>> before or after, something might have started up during the run and
> >>> ended before I could look for it.) During this period the system would
> >>> still answer pings. Nothing showed up in /var/log/messages or dmesg,
> >>> "last" showed nobody else had logged in, and overnight runs of
> >>> "smartctl -t long" on the 5 disks were clean - nothing pending, no
> >>> reallocation events.
> >>>
> >>> Today ran the first set of commands again with "nice 10" and had "top"
> >>> going and nothing untoward was observed and there were no stalls. On
> >>> that run iostat showed:
> >>>
> >>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> >>> sda 6034.00 0.00 529504.00 0 529504
> >>> sda5 6034.00 0.00 529504.00 0 529504
> >>> dm-0 68260.00 2056.00 546008.00 2056 546008
> >>>
> >>>
> >>> So why the apparent stalls yesterday? It felt like either my
> >>> interactive processes were swapped out or they had a much lower
> >>> priority than enough other processes so that they were not getting any
> >>> CPU time. Is there some sort of housekeeping that the Megaraid, LVM,
> >>> or anything normally installed with RHEL 5.10, might need to do, from
> >>> time to time, that would account for these stalls?
> >>>
> >>> Thanks,
> >>>
> >>> David Mathog
> >>> mathog at caltech.edu
> >>> Manager, Sequence Analysis Facility, Biology Division, Caltech
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >>> Computing To change your subscription (digest mode or unsubscribe)
> >>> visit http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
--
*************************************************************
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ
email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the Beowulf
mailing list