From Michael.Frese at NumerEx-LLC.com Wed Jun 4 15:52:14 2008 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] OFED/IB for FC8 Message-ID: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> Following Jeff Layton's post to this list [Cheap SDR IB] on January 28, we purchased 8 Infinihost LX's and an 8 port switch, and began trying to get the OpenFabrics (OFED) release of MVAPICH for Fedora Core 6 to run on our new machines. We develop and run a multiphysics code in a relatively fine grain parallel mode where latency dominates the performance scaling, so it seemed like a good thing to try. This is our first exposure to InfiniBand, though we have considerable experience with MPI, both in-memory and over GigE, including using netpipe to measure latency and bandwidth. Those machines have AMD Athlon X2 6000+'s on Asus M2N-SLI Deluxe motherboards with an open PCI Express slot that will handle x4. The main issue is that we are presently running Fedora Core 8 and the 2.6.21 SMP kernel, but there is no OFED release for FC8 yet. Is anyone else working on this? Has anyone succeeded at getting it to work? We started with OFED version 1.2.5 from http://www.openfabrics.org/downloads/OFED/ofed-1.2.5/OFED-1.2.5-RPMS/ We downloaded all the rpms from redhat-release-4AS-6.1 version. In particular the kernel rpms are kernel-ib-devel-1.2-2.6.9_55.ELsmp and kernel-ib-1.2-2.6.9_55.ELsmp. We used the 1.2.5 version because there don't seem to be any rpms for the 1.3 version. All the OFED rpm's for FC6 installed on FC8 without difficulty, except for opensm-3.0.3-0.ppc64.rpm It didn't say "missing dependencies ..." It just got stuck. We had to kill the 'rpm -ivh', remove the lock file and rebuild the rpm database. After that, # lsmod | grep ib shows about 15 IB related kernel mods. Even so, at this point, some of the IB stuff works. We can run ibnetdiscover and see the HCA's on the two machines that have the rpm's installed, and the switch, too. We could use that to make a topology file, but we don't know where to put it, or even if we should put it somewhere. We can run ibchecknet, and though it finds 4 nodes, it says they are all bad. It also reports "lid 0 address resolution: FAILED". We have not succeeded in getting ibping to work, and aren't really sure what how to specify the remote address for it. We found /usr/share/doc/ofed-docs-1.2/README.txt /usr/share/doc/ofed-docs-1.2/OFED_Installation_Guide.txt and, as described there, did # /etc/init.d/openibd start Loading QLogic InfiniPath driver: [FAILED] Loading HCA driver and Access Layer: [ OK ] Setting up InfiniBand network interfaces: Failed to configure IPoIB connected mode for ib0 Bringing up interface ib0: [FAILED] Setting up service network . . . [ done ] Loading ib_sdp [FAILED] Loading ib_vnic [FAILED] Module ib_vnic not loaded. Bringing up VNIC interfaces [FAILED] That mostly looks bad. Does anyone have any suggestions? We are willing to try a build from source, but we are unsure of what challenges might lie down that path. We'd rather not fall back to FC6, but we may have to do that. Thanks for your help. Mike Frese -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080604/085e9134/attachment.html From lindahl at pbm.com Wed Jun 4 17:15:49 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> Message-ID: <20080605001549.GE27430@bx9.net> > All the OFED rpm's for FC6 installed on FC8 without difficulty, > except for opensm-3.0.3-0.ppc64.rpm This is the cause of most of your subsequent problems. Without an SM running somewhere on your network, the links don't come fully up. There are mailing lists devoted to OFED that you could ask on. Building the software from scratch is probably the most straight-forward way to get something that works. -- greg From john.hearns at streamline-computing.com Wed Jun 4 22:36:44 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> Message-ID: <1212644214.7657.6.camel@Vigor13> On Wed, 2008-06-04 at 16:52 -0600, Michael H. Frese wrote: > Does anyone have any suggestions? > > We are willing to try a build from source, but we are unsure of what > challenges might lie down that path. > I agree with Greg. Build it from source - that way you will have the latest version (1.3), you will learn about the software stack whilst doing it and you will know which switches were used during the configuration process. Depending solely on distribution supplied RPMs for any HPC type software is a bad move IMHO - as you've just seen it might not have a feature you need, or might not install exactly the way you want it. And you'll always have the version supplied by the distribution, and won't be able to update if you hit a bug or need a new feature. Remember, we're in the era of open source. That's why you chose to use Linux - you have control. John Hearns From mfatica at gmail.com Wed Jun 4 16:27:40 2008 From: mfatica at gmail.com (Massimiliano Fatica) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> Message-ID: <8e6393ac0806041627k1a0f58c0r319a603154d33068@mail.gmail.com> We have the same switch. I was able to get it to work with the latest OFED 1.3 ( available from the Mellanox web site). They have rpms for RHEL4 and RHEL5. Massimiliano On Wed, Jun 4, 2008 at 3:52 PM, Michael H. Frese < Michael.Frese@numerex-llc.com> wrote: > Following Jeff Layton's post to this list [Cheap SDR IB] on January 28, > we purchased 8 Infinihost LX's and an 8 port switch, and began trying to > get > the OpenFabrics (OFED) release of MVAPICH for Fedora Core 6 to run on our > new > machines. We develop and run a multiphysics code in a relatively fine > grain parallel mode > where latency dominates the performance scaling, so it seemed like a good > thing to try. > > This is our first exposure to InfiniBand, though we have considerable > experience with MPI, both in-memory and over GigE, including using netpipe > to > measure latency and bandwidth. > > Those machines have AMD Athlon X2 6000+'s on Asus M2N-SLI Deluxe > motherboards > with an open PCI Express slot that will handle x4. > > The main issue is that we are presently running Fedora Core 8 and the > 2.6.21 > SMP kernel, but there is no OFED release for FC8 yet. Is anyone else > working > on this? Has anyone succeeded at getting it to work? > > We started with OFED version 1.2.5 from > http://www.openfabrics.org/downloads/OFED/ofed-1.2.5/OFED-1.2.5-RPMS/ > We downloaded all the rpms from redhat-release-4AS-6.1 version. > In particular the kernel rpms are kernel-ib-devel-1.2-2.6.9_55.ELsmp and > kernel-ib-1.2-2.6.9_55.ELsmp. > > We used the 1.2.5 version because there don't seem to be any rpms for the > 1.3 version. > > All the OFED rpm's for FC6 installed on FC8 without difficulty, except for > opensm-3.0.3-0.ppc64.rpm > It didn't say "missing dependencies ..." It just got stuck. We had to kill > the 'rpm -ivh', remove the lock file > and rebuild the rpm database. After that, > > # lsmod | grep ib > > shows about 15 IB related kernel mods. > > Even so, at this point, some of the IB stuff works. We can run > ibnetdiscover and see the HCA's on the > two machines that have the rpm's installed, and the switch, too. We could > use > that to make a topology file, but we don't know where to put it, or even if > we > should put it somewhere. We can run ibchecknet, and though it finds 4 > nodes, > it says they are all bad. It also reports "lid 0 address resolution: > FAILED". We have not succeeded in getting ibping to work, and aren't > really > sure what how to specify the remote address for it. > > We found > > /usr/share/doc/ofed-docs-1.2/README.txt > /usr/share/doc/ofed-docs-1.2/OFED_Installation_Guide.txt > > and, as described there, did > > # /etc/init.d/openibd start > Loading QLogic InfiniPath driver: [FAILED] > Loading HCA driver and Access Layer: [ OK ] > Setting up InfiniBand network interfaces: > Failed to configure IPoIB connected mode for ib0 > Bringing up interface ib0: [FAILED] > Setting up service network . . . [ done ] > Loading ib_sdp [FAILED] > Loading ib_vnic [FAILED] > Module ib_vnic not loaded. > Bringing up VNIC interfaces [FAILED] > > That mostly looks bad. > > Does anyone have any suggestions? > > We are willing to try a build from source, but we are unsure of what > challenges might lie down that path. > > We'd rather not fall back to FC6, but we may have to do that. > > Thanks for your help. > > > Mike Frese > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080604/7b66ecc1/attachment.html From rainer at lfbs.RWTH-Aachen.DE Thu Jun 5 02:38:24 2008 From: rainer at lfbs.RWTH-Aachen.DE (Rainer Finocchiaro) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <20080605001549.GE27430@bx9.net> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> <20080605001549.GE27430@bx9.net> Message-ID: <4847B410.1070202@lfbs.rwth-aachen.de> Hi Michael, Greg Lindahl schrieb: >> All the OFED rpm's for FC6 installed on FC8 without difficulty, >> except for opensm-3.0.3-0.ppc64.rpm > > This is the cause of most of your subsequent problems. Without an SM > running somewhere on your network, the links don't come fully up. > > There are mailing lists devoted to OFED that you could ask > on. Building the software from scratch is probably the most > straight-forward way to get something that works. > > -- greg I completely agree with Greg. I will add some comments, as I have just installed the distribution (the hard way: under Debian). Following your link, I reach a download directory offering only ppc64-RPMs; in fact all precompiled RPMs for OFED-1.2.5 are for Power PC and not for x86. You could try OFED-1.2, where all the precompiled RPMs are for x86_64, which should be suitable for your processor, acutally depending on the type of distribution you installed (32bit vs. 64bit). Much better is to download more up-to-date OFED-1.3 sources. The package includes an install script, which builds and installs the RPMs for you. So you don't have to "fear" to install something which is not controlled by your package management system (RPM). Regards, Rainer From kus at free.net Thu Jun 5 08:42:22 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect Message-ID: How is possible to detect, that particular AMD Barcelona CPU has - or doesn't have - known hardware error problem ? To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Inst. of Organic Chemistry Moscow From hahn at mcmaster.ca Thu Jun 5 08:57:28 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: References: Message-ID: > To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or > w/o error ? AMD, like Intel, does a reasonable job of disclosing such info: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed in B3. From kus at free.net Thu Jun 5 09:39:02 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: Message-ID: In message from Mark Hahn (Thu, 5 Jun 2008 11:57:28 -0400 (EDT)): >> To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping >>w/error or >> w/o error ? > >AMD, like Intel, does a reasonable job of disclosing such info: > >http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF > >the well-known problem is erattum 298, I think, and fixed in B3. Yes, this AMD errata document says that in B3 revision the error "will be fixed". I heard that new CPUs w/o TLB+L3 error are shipped now, but are this CPUs really B3 or may be have some more new release ? Mikhail From hahn at mcmaster.ca Thu Jun 5 10:30:57 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: References: Message-ID: >> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF >> >> the well-known problem is erattum 298, I think, and fixed in B3. > > Yes, this AMD errata document says that in B3 revision the error "will be > fixed". I believe the absence of 'x' in the B3 column of the table on p 15 means that it _is_ fixed in B3. From kus at free.net Thu Jun 5 10:48:32 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: Message-ID: In message from Mark Hahn (Thu, 5 Jun 2008 13:30:57 -0400 (EDT)): >>> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF >>> >>> the well-known problem is erattum 298, I think, and fixed in B3. >> >> Yes, this AMD errata document says that in B3 revision the error >>"will be >> fixed". > >I believe the absence of 'x' in the B3 column of the table on p 15 >means that it _is_ fixed in B3. I received just now some preliminary data about Gaussian-03 run problems w/B2 and about absence of this problems w/B3. Yours Mikhail From kus at free.net Thu Jun 5 11:09:58 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: Message-ID: In message from Mark Hahn (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): >>> I believe the absence of 'x' in the B3 column of the table on p 15 >>> means that it _is_ fixed in B3. >> >> I received just now some preliminary data about Gaussian-03 run >>problems w/B2 >> and about absence of this problems w/B3. > >I'm mystified by this: B2 was broken, so using it without the bios >workaround is just a mistake or masochism. the workaround _did_ >apparently have performance implications, but that's why B3 exists... > >do you mean you know of G03 problems on B2 systems which are operating >_with_ the workaround? I don't know exactly, but I think the crash was under absence of workaround, because I was not informed that there was some kernel patches or BIOS changes. This was interesting for me also, because I have no information how this hardware problem may be affected in the "real life". Mikhail From kus at free.net Thu Jun 5 11:22:36 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: <588c11220806051116i37ff7aa1oec16a85a24009592@mail.gmail.com> Message-ID: In message from "Jason Clinton" (Thu, 5 Jun 2008 13:16:33 -0500): >On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky >wrote: > >> In message from Mark Hahn (Thu, 5 Jun 2008 >>13:55:01 >> -0400 (EDT)): >> >>> I'm mystified by this: B2 was broken, so using it without the bios >>> workaround is just a mistake or masochism. the workaround _did_ >>>apparently >>> have performance implications, but that's why B3 exists... >>> >>> do you mean you know of G03 problems on B2 systems which are >>>operating >>> _with_ the workaround? >>> >> >> I don't know exactly, but I think the crash was under absence of >> workaround, because I was not informed that there was some kernel >>patches or >> BIOS changes. This was interesting for me also, because I have no >> information how this hardware problem may be affected in the "real >>life". >> Mikhail >> > >The B2 BIOS work-around is to disable the L3 cache which gives you a >10-20% >performance hit with no reduction in power consumption. > >The kernel patch is very extensive and, last I heard, under NDA. AMD >has >said publicly that the patch gives you a 1-2% performance hit. This URL is old, but may give some information: https://www.x86-64.org/pipermail/discuss/2007-December/010260.html Mikhail From lindahl at pbm.com Thu Jun 5 11:30:20 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: References: Message-ID: <20080605183020.GA11661@bx9.net> On Thu, Jun 05, 2008 at 10:09:58PM +0400, Mikhail Kuzminsky wrote: > This was interesting for me also, because I > have no information how this hardware problem may be affected in the > "real life". I have 4 chips with the bug, in 2 servers. I see about 1 lockup per month with my workload, which doesn't include any VMs. (VMs are reputed to trigger the bug quickly.) I found a webpage with the details, and indeed this is what I see: | The system may experience a machine check event reporting an L3 | protocol error has occurred. In this case, the MC4 status register | (MSR 0000_0410) will be equal to B2000000_000B0C0F or | BA000000_000B0C0F. The MC4 address register (MSR 0000_0412) will be | equal to 26h.' -- greg From hahn at mcmaster.ca Thu Jun 5 11:43:05 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: <588c11220806051116i37ff7aa1oec16a85a24009592@mail.gmail.com> References: <588c11220806051116i37ff7aa1oec16a85a24009592@mail.gmail.com> Message-ID: > The kernel patch is very extensive and, last I heard, under NDA. AMD has the kernel patch was publicly distributed in dec 07. it appears to add some kernel logic to avoid the specific L3 TLB states which don't behave correctly. the bios-level workaround is different, and appears to disable the L3 TLB - I don't know whether that actually disables the L3 itself... while extremely unfortunate - so much that it clearly threatens the viability of the company, I think AMD responded reasonably. From csamuel at vpac.org Fri Jun 6 04:31:45 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: <971758047.108911212751290783.JavaMail.root@zimbra.vpac.org> Message-ID: <937389196.108931212751905042.JavaMail.root@zimbra.vpac.org> ----- "Mark Hahn" wrote: > the kernel patch was publicly distributed in dec 07. > it appears to add some kernel logic to avoid the specific > L3 TLB states which don't behave correctly. the bios-level > workaround is different, and appears to disable the L3 TLB - > I don't know whether that actually disables the L3 itself... I believe the patch re-enables the L3 cache and then works around the problem in software. When we were running B2 Barcelonas with this patch we didn't hit the errata and didn't see the performance penalty we would have expected if the L3 was disabled. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Fri Jun 6 04:33:38 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: Message-ID: <741026114.108961212752018697.JavaMail.root@zimbra.vpac.org> ----- "Mikhail Kuzminsky" wrote: > Yes, this AMD errata document says that in B3 revision the error "will > be fixed". I heard that new CPUs w/o TLB+L3 error are shipped now, > but are this CPUs really B3 or may be have some more new release ? They certainly do exist, we've got 94 nodes of them here and no longer require the kernel patch to work around the errata. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From gerry.creager at tamu.edu Fri Jun 6 08:39:47 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments Message-ID: <48495A43.4060809@tamu.edu> We recently purchased a set of hardware for a cluster from a hardware vendor. We've encountered a couple of interesting issues with bringing the thing up that I'd like to get group comments on. Note that the RFP and negotiations specified this system was for a cluster installation, so there would be no misunderstanding... 1. We specified "No OS" in the purchase so that we could install CentOS as our base. We got a set of systems with a stub OS, and an EULA for the diagnostics embedded on the disk. After clicking thru the EULA, it tells us we have no OS on the disk, but does not fail to PXE. 2. BIOS had a couple of interesting defaults, including warn on keyboard error (Keyboard? Not intentionally. This is a compute node, and should never require a keyboard. Ever.) We also find the BIOS is set to boot from hard disk THEN PXE. But due to item 1, above, we never can fail over to PXE unless we load up a keyboard and monitor, and hit F12 to drop to PXE. In discussions with our sales rep, I'm told that we'd have had to pay extra to get a real bare hard disk, and that, for a fee, they'd have been willing to custom-configure the BIOS. OK, with the BIOS this isn't too unreasonable: They have a standard BIOS for all systems and if you want something special, paying for it's the norm... But, still, this is a CLUSTER installation we were quoted, not a desktop. Also, I'm now told that "almost every customer" ordered their cluster configuration service at several kilobucks per rack. Since the team I'm working with has some degree of experience in configuring and installing hardware and software on computational clusters, now measured in at least 10 separate cluster installations, this seemed like an unnecessary expense. However, we're finding vendor gotchas that are annoying at the least, and sometimes cause significant work-around time/effort. Finally, our sales guy yesterday was somewhat baffled as to why we'd ordered without OS, and further why we were using Linux over Windows for HPC. Not trying to revive the recent rant-fest about Windows HPC capabilities, can anyone cite real HPC applications generally run on significant clusters (I'll accept Cornell's work, although I remain personally convinced that the bulk of their Windows HPC work has been dedicated to maintaining grant funding rather than doing real work)? No, I won't identify the vendor. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From dag at sonsorol.org Fri Jun 6 09:15:24 2008 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <13F1DDCE-2FE6-4916-B1A5-995237EA80B6@sonsorol.org> Bad job hiding the (obvious) vendor ;) I'm riding the bus back home to Boston after a cluster building gig and your experience exactly matches what I encountered when I walked into the datacenter to start work on a pile of dell 1950 servers. I'll do you one better - 4 nodes out of our "homogenous" cluster had reversed drive cabling which broke our imaging system as we had specific data to place on 2 drives of differing capacity. Regards, Chris /* Sent via phone - apologies for typos & terseness */ On Jun 6, 2008, at 11:39 AM, Gerry Creager wrote: > We recently purchased a set of hardware for a cluster from a > hardware vendor. We've encountered a couple of interesting issues > with bringing the thing up that I'd like to get group comments on. > Note that the RFP and negotiations specified this system was for a > cluster installation, so there would be no misunderstanding... > > 1. We specified "No OS" in the purchase so that we could install > CentOS as our base. We got a set of systems with a stub OS, and an > EULA for the diagnostics embedded on the disk. After clicking thru > the EULA, it tells us we have no OS on the disk, but does not fail > to PXE. > > 2. BIOS had a couple of interesting defaults, including warn on > keyboard error (Keyboard? Not intentionally. This is a compute > node, and should never require a keyboard. Ever.) We also find the > BIOS is set to boot from hard disk THEN PXE. But due to item 1, > above, we never can fail over to PXE unless we load up a keyboard > and monitor, and hit F12 to drop to PXE. > > In discussions with our sales rep, I'm told that we'd have had to > pay extra to get a real bare hard disk, and that, for a fee, they'd > have been willing to custom-configure the BIOS. OK, with the BIOS > this isn't too unreasonable: They have a standard BIOS for all > systems and if you want something special, paying for it's the > norm... But, still, this is a CLUSTER installation we were quoted, > not a desktop. > > Also, I'm now told that "almost every customer" ordered their > cluster configuration service at several kilobucks per rack. Since > the team I'm working with has some degree of experience in > configuring and installing hardware and software on computational > clusters, now measured in at least 10 separate cluster > installations, this seemed like an unnecessary expense. However, > we're finding vendor gotchas that are annoying at the least, and > sometimes cause significant work-around time/effort. > > Finally, our sales guy yesterday was somewhat baffled as to why we'd > ordered without OS, and further why we were using Linux over Windows > for HPC. Not trying to revive the recent rant-fest about Windows > HPC capabilities, can anyone cite real HPC applications generally > run on significant clusters (I'll accept Cornell's work, although I > remain personally convinced that the bulk of their Windows HPC work > has been dedicated to maintaining grant funding rather than doing > real work)? > > No, I won't identify the vendor. > -- > Gerry Creager -- gerry.creager@tamu.edu > Texas Mesonet -- AATLT, Texas A&M University > Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Fri Jun 6 09:43:32 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <1212770622.9679.5.camel@Vigor13> On Fri, 2008-06-06 at 10:39 -0500, Gerry Creager wrote: > 1. We specified "No OS" in the purchase so that we could install CentOS > as our base. We got a set of systems with a stub OS, and an EULA for > the diagnostics embedded on the disk. After clicking thru the EULA, it > tells us we have no OS on the disk, but does not fail to PXE. That sounds normal to me - that's the state we get servers in. > 2. BIOS had a couple of interesting defaults, including warn on > keyboard error (Keyboard? Not intentionally. This is a compute node, > and should never require a keyboard. Ever.) We also find the BIOS is > set to boot from hard disk THEN PXE. But due to item 1, above, we never > can fail over to PXE unless we load up a keyboard and monitor, and hit > F12 to drop to PXE. The "warn on keyboard error" is a bit of a shocker - I haven't see that one for ages. But the rest sound normal, and yes getting the keyboard and monitor out is normal. We get technicians to set the BIOSes on all compute nodes prior to delivery. I just don't see why though that the BIOSes from these major vendors are ALWAYS hard disk first then PXE. A default setting t'other way around would make much more sense. From landman at scalableinformatics.com Fri Jun 6 09:45:27 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <484969A7.4030008@scalableinformatics.com> Hi Gary A first point, before going anywhere else ... you get what you pay for ... most of the time. The vast majority of rack-n-stack vendors do what you describe. They know one thing, any deviation from that leaves them blinking with open mouths ... they deliver what they have a level of comfort delivering. Gerry Creager wrote: > We recently purchased a set of hardware for a cluster from a hardware > vendor. We've encountered a couple of interesting issues with bringing > the thing up that I'd like to get group comments on. Note that the RFP > and negotiations specified this system was for a cluster installation, > so there would be no misunderstanding... > > 1. We specified "No OS" in the purchase so that we could install CentOS > as our base. We got a set of systems with a stub OS, and an EULA for > the diagnostics embedded on the disk. After clicking thru the EULA, it > tells us we have no OS on the disk, but does not fail to PXE. I would say it is likely due to the fact that altering their base cluster construction model is a problem (e.g. costs them money). FWIW: We boot our nodes diskless during testing, and diskful if this is the required state of the cluster. Nothing like actually testing the hardware you are going to deliver in the way your customers are going to use it. This said, it seems ... unlikely ... that this was their purpose. > > 2. BIOS had a couple of interesting defaults, including warn on > keyboard error (Keyboard? Not intentionally. This is a compute node, > and should never require a keyboard. Ever.) We also find the BIOS is > set to boot from hard disk THEN PXE. But due to item 1, above, we never > can fail over to PXE unless we load up a keyboard and monitor, and hit > F12 to drop to PXE. Egad. We (by hand) reconfigure the bios specifically so that there are no issues like this. > In discussions with our sales rep, I'm told that we'd have had to pay > extra to get a real bare hard disk, and that, for a fee, they'd have Heh.... if you want nothing what is it you have to pay? :) > been willing to custom-configure the BIOS. OK, with the BIOS this isn't > too unreasonable: They have a standard BIOS for all systems and if you > want something special, paying for it's the norm... But, still, this is > a CLUSTER installation we were quoted, not a desktop. Agreed. > > Also, I'm now told that "almost every customer" ordered their cluster > configuration service at several kilobucks per rack. Since the team I'm This is standard, it costs money to rack and stack. If you don't want it, you don't have to get it. > working with has some degree of experience in configuring and installing > hardware and software on computational clusters, now measured in at > least 10 separate cluster installations, this seemed like an unnecessary > expense. However, we're finding vendor gotchas that are annoying at the Yeah, in this case, it is unnecessary. If your team has the expertise, you don't need to pay for it. > least, and sometimes cause significant work-around time/effort. For the smaller companies that do cluster setup/installs, the idea is not to mess the customer up. > > Finally, our sales guy yesterday was somewhat baffled as to why we'd > ordered without OS, and further why we were using Linux over Windows for > HPC. Not trying to revive the recent rant-fest about Windows HPC You do understand how hard (e.g. how much money is flowing from) Microsoft is pushing their solution. Money talks. > capabilities, can anyone cite real HPC applications generally run on > significant clusters (I'll accept Cornell's work, although I remain > personally convinced that the bulk of their Windows HPC work has been > dedicated to maintaining grant funding rather than doing real work)? > > No, I won't identify the vendor. :) -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From john.hearns at streamline-computing.com Fri Jun 6 09:55:05 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <1212771315.9679.15.camel@Vigor13> On Fri, 2008-06-06 at 10:39 -0500, Gerry Creager wrote: > W > Also, I'm now told that "almost every customer" ordered their cluster > configuration service at several kilobucks per rack. Since the team I'm > working with has some degree of experience in configuring and installing > hardware and software on computational clusters, now measured in at > least 10 separate cluster installations, this seemed like an unnecessary > expense. However, we're finding vendor gotchas that are annoying at the > least, and sometimes cause significant work-around time/effort. Somebody has to pay for all those technicians to set you BIOSes. Seriously, we almost always do turn-key clusters for customers. We do what we term as "hardware only" deals - but you wouldn't recognise them as such. The project I'm thinking on consisted of supplying many Intel twin servers, and us setting those BIOSes prior to delivery, racking, labelling and cabling all servers and switches. Providing a loan cluster head node and installing our cluster distribution on there for a week long soak test prior to the customer accepting it and then reinstalling with their own OS. Quite, quite far from leaving a pile of boxes on the loading dock. I don't want to go into this one too deeply, but when we do true hardware-only deals (I'm thinking more of network switches here) you end up supporting things out of goodwill anyway. From bill at cse.ucdavis.edu Fri Jun 6 10:00:29 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <48496D2D.5080801@cse.ucdavis.edu> Gerry Creager wrote: > 1. We specified "No OS" in the purchase so that we could install CentOS > as our base. We got a set of systems with a stub OS, and an EULA for > the diagnostics embedded on the disk. After clicking thru the EULA, it > tells us we have no OS on the disk, but does not fail to PXE. If you want to avoid hooking up a KVM to each node and rebooting it once or twice I'd suggest putting "Nodes must PXE boot by default" in your specifications. > 2. BIOS had a couple of interesting defaults, including warn on > keyboard error (Keyboard? Not intentionally. This is a compute node, > and should never require a keyboard. Ever.) We also find the BIOS is > set to boot from hard disk THEN PXE. But due to item 1, above, we never > can fail over to PXE unless we load up a keyboard and monitor, and hit > F12 to drop to PXE. Very strange standard for a server, let alone a cluster node. > In discussions with our sales rep, I'm told that we'd have had to pay > extra to get a real bare hard disk, and that, for a fee, they'd have > been willing to custom-configure the BIOS. OK, with the BIOS this isn't > too unreasonable: They have a standard BIOS for all systems and if you > want something special, paying for it's the norm... But, still, this is > a CLUSTER installation we were quoted, not a desktop. This whole thing sounds strangely like the vendor has already been picked. Certainly changing any default in the pipeline can cost money, even deleting a floppy, cd/dvd etc can cost money if the machine ships to the integration center with it installed. With that said when someone charges an unreasonable amount for said customizations they lose the bid and someone else wins. > Also, I'm now told that "almost every customer" ordered their cluster > configuration service at several kilobucks per rack. Since the team I'm Not sure of the relevance here. Sounds like the upsell and padding that sales folks love, it is there job to sell equipment preferably high margin at that. Seems way high for a BIOS reset, less so if it includes a cabling harness for power, console, rails premounted, and network. Again if it's a bid process.... > working with has some degree of experience in configuring and installing > hardware and software on computational clusters, now measured in at > least 10 separate cluster installations, this seemed like an unnecessary > expense. However, we're finding vendor gotchas that are annoying at the > least, and sometimes cause significant work-around time/effort. Well there's two choice, either deal with the gotchas, or make them part of the specifications. All vendors have their differences, defaults, and cost structures. Do you want a cluster that could conceivable allow users to start submitting jobs within a day? Or do you want to play BIOS games, testing, and integration that might take a week or two. Every time I order a cluster (well over 10 now) I get vendor queries of the "Sounds like X might mean you need Y which costs $Z". I'm always very clear, it's in the spec, and not meeting the spec will mean the bid isn't considered. Definitely seems like some high margin items end up included... without the margin. > Finally, our sales guy yesterday was somewhat baffled as to why we'd > ordered without OS, and further why we were using Linux over Windows for > HPC. Heh, some sales folks seem to have a right to exert design pressure on cluster design, not sure why your even entertaining that one. If you want to be particularly friendly I'd just point at top500.org and that linux is the standard and not the exception for beowulf clusters. > No, I won't identify the vendor. How about the number of letters in their name ;-). In general I find that the big vendors build in large profits (I.e. negotiating down to 50% of list price is not unusual) and often the preferred cluster defaults often mean higher costs instead of less, despite the typically higher volume purchases, identical compute nodes, don't need a dvd, don't need an OS, don't (typically) need a redundant power supply for compute nodes, etc. The smaller cluster specific shops default (usually) to mostly reasonable cluster configurations, and seem to default to smaller margins. In my experience, writing a spec that welcomes both ends up with the best deals. Even something trivial like specifying 14 or 15 disks in a array (often the max for an external array) instead of 16 (common for direct attached) can be the different to allow a competitive bid from a big vendor. Sometimes Intel or AMD intercedes to get a design win and sometimes a big vendor decides to get more competitive. Of course these specifications directly effect costs and lead to endless discussions on this list. KVM over IP? Serial console? Any console access at all? IMPI or just switched PDUs? But in my experience things like "must boot from PXE" is not a big deal, and not worth several kilobucks. From perry at piermont.com Fri Jun 6 10:45:41 2008 From: perry at piermont.com (Perry E. Metzger) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48496D2D.5080801@cse.ucdavis.edu> (Bill Broadley's message of "Fri\, 06 Jun 2008 10\:00\:29 -0700") References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> Message-ID: <87hcc6h2yi.fsf@snark.cb.piermont.com> Bill Broadley writes: >> 2. BIOS had a couple of interesting defaults, including warn on >> keyboard error (Keyboard? Not intentionally. This is a compute >> node, and should never require a keyboard. Ever.) We also find the >> BIOS is set to boot from hard disk THEN PXE. But due to item 1, >> above, we never can fail over to PXE unless we load up a keyboard >> and monitor, and hit F12 to drop to PXE. > > Very strange standard for a server, let alone a cluster node. I would be less disturbed about such things if it was trivial to alter the BIOS settings in a semi-automated way -- say by booting some standalone program, or loading a file from a USB thumb drive. Then you could just go up to each box with a USB thumb drive, turn it on, and have it fix itself in a consistent way. However, the fact that you can't generally automate fixing BIOS settings makes all of this far more annoying. Anyone have any cool tricks for how to consistently set the BIOS on large numbers of boxes without requiring steps that humans can screw up easily? -- Perry E. Metzger perry@piermont.com From tjrc at sanger.ac.uk Fri Jun 6 11:35:25 2008 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <87hcc6h2yi.fsf@snark.cb.piermont.com> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> Message-ID: <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> On 6 Jun 2008, at 6:45 pm, Perry E. Metzger wrote: > > Bill Broadley writes: >>> 2. BIOS had a couple of interesting defaults, including warn on >>> keyboard error (Keyboard? Not intentionally. This is a compute >>> node, and should never require a keyboard. Ever.) We also find the >>> BIOS is set to boot from hard disk THEN PXE. But due to item 1, >>> above, we never can fail over to PXE unless we load up a keyboard >>> and monitor, and hit F12 to drop to PXE. >> >> Very strange standard for a server, let alone a cluster node. > > I would be less disturbed about such things if it was trivial to alter > the BIOS settings in a semi-automated way -- say by booting some > standalone program, or loading a file from a USB thumb drive. Then you > could just go up to each box with a USB thumb drive, turn it on, and > have it fix itself in a consistent way. However, the fact that you > can't generally automate fixing BIOS settings makes all of this far > more annoying. > > Anyone have any cool tricks for how to consistently set the BIOS on > large numbers of boxes without requiring steps that humans can screw > up easily? Nope. :-) This is, in my view, one of the major disadvantages of PC clusters. The crappy old BIOS that we're stuck with. Here, we mostly get around this problem by using blade servers rather than pizza boxes. Or at least using pizza boxes which have some form of command line access to a lights-out management processor that allows us to set the boot order, such as those on HP ProLiants and Sun X**** servers. So with c-Class blades from HP, for example, I don't really have a problem - once the chassis is configured, I make them all PXE boot by ssh'ing into the Onboard administrator and typing: set server boot first pxe all poweron server all Bingo, all 16 machines PXE boot at about 1 second intervals. Job's a good'un. As Joe says, you get what you pay for. I don't think I've *ever* had to futz around with BIOS settings on any recent bladeserver (I used to have to on our old RLX bladeservers, which periodically got confused and lost all the CMOS settings, which required manual fixing in the BIOS). But the IBM and HP stuff we use now, it's very rare indeed. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From matt at technoronin.com Fri Jun 6 15:55:44 2008 From: matt at technoronin.com (Matt Lawrence) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: On Fri, 6 Jun 2008, Gerry Creager wrote: > We recently purchased a set of hardware for a cluster from a hardware vendor. > We've encountered a couple of interesting issues with bringing the thing up > that I'd like to get group comments on. Note that the RFP and negotiations > specified this system was for a cluster installation, so there would be no > misunderstanding... > > 1. We specified "No OS" in the purchase so that we could install CentOS as > our base. We got a set of systems with a stub OS, and an EULA for the > diagnostics embedded on the disk. After clicking thru the EULA, it tells us > we have no OS on the disk, but does not fail to PXE. And standing on the perforated floor tiles in front of the system performing this task all day long has left me with sore and very cold feet. The good news is that the cooling in there is working much better. Our student worker is on vacation sailing in the Carribean, so I get the abuse.... -- Matt It's not what I know that counts. It's what I can remember in time to use. From gerry.creager at tamu.edu Fri Jun 6 17:34:53 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> Message-ID: <4849D7AD.30605@tamu.edu> Tim Cutts wrote: > > On 6 Jun 2008, at 6:45 pm, Perry E. Metzger wrote: > >> >> Bill Broadley writes: >>>> 2. BIOS had a couple of interesting defaults, including warn on >>>> keyboard error (Keyboard? Not intentionally. This is a compute >>>> node, and should never require a keyboard. Ever.) We also find the >>>> BIOS is set to boot from hard disk THEN PXE. But due to item 1, >>>> above, we never can fail over to PXE unless we load up a keyboard >>>> and monitor, and hit F12 to drop to PXE. >>> >>> Very strange standard for a server, let alone a cluster node. >> >> I would be less disturbed about such things if it was trivial to alter >> the BIOS settings in a semi-automated way -- say by booting some >> standalone program, or loading a file from a USB thumb drive. Then you >> could just go up to each box with a USB thumb drive, turn it on, and >> have it fix itself in a consistent way. However, the fact that you >> can't generally automate fixing BIOS settings makes all of this far >> more annoying. >> >> Anyone have any cool tricks for how to consistently set the BIOS on >> large numbers of boxes without requiring steps that humans can screw >> up easily? > > Nope. :-) This is, in my view, one of the major disadvantages of PC > clusters. The crappy old BIOS that we're stuck with. > > Here, we mostly get around this problem by using blade servers rather > than pizza boxes. Or at least using pizza boxes which have some form of > command line access to a lights-out management processor that allows us > to set the boot order, such as those on HP ProLiants and Sun X**** servers. > > So with c-Class blades from HP, for example, I don't really have a > problem - once the chassis is configured, I make them all PXE boot by > ssh'ing into the Onboard administrator and typing: > > set server boot first pxe all > poweron server all > > Bingo, all 16 machines PXE boot at about 1 second intervals. Job's a > good'un. As Joe says, you get what you pay for. I don't think I've > *ever* had to futz around with BIOS settings on any recent bladeserver > (I used to have to on our old RLX bladeservers, which periodically got > confused and lost all the CMOS settings, which required manual fixing in > the BIOS). But the IBM and HP stuff we use now, it's very rare indeed. Yeah.... Part of the problem. The last several clusters I've worked on, we didn't have to futz with the BIOS, either. HOWEVER, it's been pointed out to me that "You get what you pay for" and part of what you pay for is the competent folks making sure such futzing isn't required. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From tjrc at sanger.ac.uk Sat Jun 7 00:14:37 2008 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <4849D7AD.30605@tamu.edu> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> <4849D7AD.30605@tamu.edu> Message-ID: <70D27FFA-DAD1-4493-9D51-72E149FCA5EF@sanger.ac.uk> On 7 Jun 2008, at 1:34 am, Gerry Creager wrote: > Yeah.... Part of the problem. The last several clusters I've worked > on, we didn't have to futz with the BIOS, either. HOWEVER, it's > been pointed out to me that "You get what you pay for" and part of > what you pay for is the competent folks making sure such futzing > isn't required. Well, there is that. Or at least, paying for people to do the dreary futzing for you. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From gerry.creager at tamu.edu Sat Jun 7 07:49:42 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <70D27FFA-DAD1-4493-9D51-72E149FCA5EF@sanger.ac.uk> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> <4849D7AD.30605@tamu.edu> <70D27FFA-DAD1-4493-9D51-72E149FCA5EF@sanger.ac.uk> Message-ID: <484AA006.9070009@tamu.edu> And done here. Tim Cutts wrote: > > On 7 Jun 2008, at 1:34 am, Gerry Creager wrote: > >> Yeah.... Part of the problem. The last several clusters I've worked >> on, we didn't have to futz with the BIOS, either. HOWEVER, it's been >> pointed out to me that "You get what you pay for" and part of what you >> pay for is the competent folks making sure such futzing isn't required. > > Well, there is that. Or at least, paying for people to do the dreary > futzing for you. > > Tim > > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From gerry.creager at tamu.edu Sat Jun 7 07:50:44 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <484AA006.9070009@tamu.edu> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> <4849D7AD.30605@tamu.edu> <70D27FFA-DAD1-4493-9D51-72E149FCA5EF@sanger.ac.uk> <484AA006.9070009@tamu.edu> Message-ID: <484AA044.1040107@tamu.edu> Sorry about that. Wrong message when I hit "reply all". Time for more coffee. gerry Gerry Creager wrote: > And done here. > > Tim Cutts wrote: >> >> On 7 Jun 2008, at 1:34 am, Gerry Creager wrote: >> >>> Yeah.... Part of the problem. The last several clusters I've worked >>> on, we didn't have to futz with the BIOS, either. HOWEVER, it's been >>> pointed out to me that "You get what you pay for" and part of what >>> you pay for is the competent folks making sure such futzing isn't >>> required. >> >> Well, there is that. Or at least, paying for people to do the dreary >> futzing for you. >> >> Tim >> >> > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From csamuel at vpac.org Sun Jun 8 17:09:15 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <87hcc6h2yi.fsf@snark.cb.piermont.com> Message-ID: <1110348284.110351212970155947.JavaMail.root@zimbra.vpac.org> ----- "Perry E. Metzger" wrote: > I would be less disturbed about such things if it was > trivial to alter the BIOS settings in a semi-automated > way -- say by booting some standalone program, or loading > a file from a USB thumb drive. Our most recent vendor went to the motherboard manufacturer and said "please can you cut us a BIOS with these default settings" and they did so. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From tjrc at sanger.ac.uk Sun Jun 8 21:05:02 2008 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <1110348284.110351212970155947.JavaMail.root@zimbra.vpac.org> References: <1110348284.110351212970155947.JavaMail.root@zimbra.vpac.org> Message-ID: <69E48321-05BE-4DA1-B5CC-A92A5DF24F56@sanger.ac.uk> On 9 Jun 2008, at 1:09 am, Chris Samuel wrote: > > ----- "Perry E. Metzger" wrote: > >> I would be less disturbed about such things if it was >> trivial to alter the BIOS settings in a semi-automated >> way -- say by booting some standalone program, or loading >> a file from a USB thumb drive. > > Our most recent vendor went to the motherboard manufacturer > and said "please can you cut us a BIOS with these default > settings" and they did so. If you don't mind us asking, roughly how much extra did *that* cost? Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From csamuel at vpac.org Mon Jun 9 00:28:46 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <1237440751.110541212996492356.JavaMail.root@zimbra.vpac.org> Message-ID: <992702936.110561212996526583.JavaMail.root@zimbra.vpac.org> ----- "Tim Cutts" wrote: > On 9 Jun 2008, at 1:09 am, Chris Samuel wrote: > > > Our most recent vendor went to the motherboard manufacturer > > and said "please can you cut us a BIOS with these default > > settings" and they did so. > > If you don't mind us asking, roughly how much extra did *that* cost? Nothing. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From tjrc at sanger.ac.uk Mon Jun 9 01:43:51 2008 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <992702936.110561212996526583.JavaMail.root@zimbra.vpac.org> References: <992702936.110561212996526583.JavaMail.root@zimbra.vpac.org> Message-ID: <484CED47.5050804@sanger.ac.uk> Chris Samuel wrote: > ----- "Tim Cutts" wrote: > > >> On 9 Jun 2008, at 1:09 am, Chris Samuel wrote: >> >> >>> Our most recent vendor went to the motherboard manufacturer >>> and said "please can you cut us a BIOS with these default >>> settings" and they did so. >>> >> If you don't mind us asking, roughly how much extra did *that* cost? >> > > Nothing. > Wow. How many nodes were you buying? And are we allowed to know who the vendor was? Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From csamuel at vpac.org Mon Jun 9 03:30:44 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <222401034.110641213007329196.JavaMail.root@zimbra.vpac.org> Message-ID: <767212096.110661213007444346.JavaMail.root@zimbra.vpac.org> ----- "Tim Cutts" wrote: > Wow. How many nodes were you buying? 95 nodes, each with two Barcelonas, so 760 cores all up. 32GB RAM (4GB/core) and 4x300GB SATA drives (RAID-0) per node. > And are we allowed to know who the vendor was? It's all public, so no reason why not. It was a local Melbourne mob called Xenon Systems, they sell SuperMicro based systems. Kudos to both of them for their support. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From apittman at concurrent-thinking.com Mon Jun 9 03:49:07 2008 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <48495A43.4060809@tamu.edu> References: <48495A43.4060809@tamu.edu> Message-ID: <1213008547.8064.9.camel@bruce.priv.wark.uk.streamline-computing.com> On Fri, 2008-06-06 at 10:39 -0500, Gerry Creager wrote: > > 2. BIOS had a couple of interesting defaults, including warn on > keyboard error (Keyboard? Not intentionally. This is a compute > node, > and should never require a keyboard. Ever.) We also find the BIOS > is > set to boot from hard disk THEN PXE. But due to item 1, above, we > never > can fail over to PXE unless we load up a keyboard and monitor, and > hit > F12 to drop to PXE. I can think of at least one cluster where the opposite has been true and PXE boot has been the default. The problem with this is if the head node PXE boots on the customers network and gets automatically re-installed as a windows workstation everybody gets egg on their face. Yes even "modern" BIOSes are bad but localboot first is a sensible default. Ashley Pittman. From matt at technoronin.com Mon Jun 9 06:11:53 2008 From: matt at technoronin.com (Matt Lawrence) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <1213008547.8064.9.camel@bruce.priv.wark.uk.streamline-computing.com> References: <48495A43.4060809@tamu.edu> <1213008547.8064.9.camel@bruce.priv.wark.uk.streamline-computing.com> Message-ID: On Mon, 9 Jun 2008, Ashley Pittman wrote: > I can think of at least one cluster where the opposite has been true and > PXE boot has been the default. The problem with this is if the head > node PXE boots on the customers network and gets automatically > re-installed as a windows workstation everybody gets egg on their face. > Yes even "modern" BIOSes are bad but localboot first is a sensible > default. I will have to disagree. Changing the BIOS settings in a single head node is preferable to having to connect to 126 compute nodes and change their BIOS settings. -- Matt It's not what I know that counts. It's what I can remember in time to use. From prentice at ias.edu Mon Jun 9 08:41:29 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] User resource limits Message-ID: <484D4F29.9090704@ias.edu> This topic is slightly off topic, since it's not a beowulf specific problem, but it is HPC-related: I have several fat servers with 4 cores and 32 GB of RAM, for jobs that aren't very parallel and need large amounts of RAM. They are not clustered in any way. At the moment, users ssh into these systems to run large jobs. Eventually, I will have these nodes managed by a queuing system. The problem: Every couple of days, one of these systems become unresponsive due to OOM errors. If we wait long enough, the offending job will complete, and everything will return to normal. Since these are multi-user shared resources, I don't have the luxury of waiting for the systems to clear themselves up, and I often have to hit the power button. I would like to impose some CPU and memory limits on users that are hard limits that can't be changed/overridden by the users. What is the best way to do this? All I know is environment variables or shell commands done as the user (ulimit, for example). -- Prentice From dnlombar at ichips.intel.com Mon Jun 9 09:12:32 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Thu Aug 28 01:07:08 2008 Subject: [Beowulf] User resource limits In-Reply-To: <484D4F29.9090704@ias.edu> References: <484D4F29.9090704@ias.edu> Message-ID: <20080609161232.GB11155@nlxdcldnl2.cl.intel.com> On Mon, Jun 09, 2008 at 11:41:29AM -0400, Prentice Bisbal wrote: > > I would like to impose some CPU and memory limits on users that are hard > limits that can't be changed/overridden by the users. What is the best > way to do this? All I know is environment variables or shell commands > done as the user (ulimit, for example). pam_limits and /etc/security/limits.conf -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From perry at piermont.com Mon Jun 9 09:53:41 2008 From: perry at piermont.com (Perry E. Metzger) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] User resource limits In-Reply-To: <484D4F29.9090704@ias.edu> (Prentice Bisbal's message of "Mon\, 09 Jun 2008 11\:41\:29 -0400") References: <484D4F29.9090704@ias.edu> Message-ID: <87od6av9be.fsf@snark.cb.piermont.com> Prentice Bisbal writes: > I would like to impose some CPU and memory limits on users that are hard > limits that can't be changed/overridden by the users. What is the best > way to do this? All I know is environment variables or shell commands > done as the user (ulimit, for example). ulimit is not quite "a command done by the user". The user manipulates their ulimits with the shell ulimit command, but the limits are in fact maintained by the kernel, and can be set by the administrator at maximum levels that the user cannot reduce. Read the man page for getrlimit/setrlimit for details on this. ulimits are inherited by a process from its parents, so if the process used at login (like sshd) sets them appropriately, the limits are inherited by the whole session. The administrator can set default ulimit ceilings in various login configuration files -- the file that is used depends on the specific OS you are running. If you say what OS and/or distro you are using, I can be of more specific help. Perry From perry at piermont.com Mon Jun 9 09:55:51 2008 From: perry at piermont.com (Perry E. Metzger) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] User resource limits In-Reply-To: <20080609161232.GB11155@nlxdcldnl2.cl.intel.com> (David N. Lombard's message of "Mon\, 9 Jun 2008 09\:12\:32 -0700") References: <484D4F29.9090704@ias.edu> <20080609161232.GB11155@nlxdcldnl2.cl.intel.com> Message-ID: <87k5gyv97s.fsf@snark.cb.piermont.com> "Lombard, David N" writes: > On Mon, Jun 09, 2008 at 11:41:29AM -0400, Prentice Bisbal wrote: >> >> I would like to impose some CPU and memory limits on users that are hard >> limits that can't be changed/overridden by the users. What is the best >> way to do this? All I know is environment variables or shell commands >> done as the user (ulimit, for example). > > pam_limits and /etc/security/limits.conf You're making assumptions about what OS he's running. He didn't say which flavor of Unix this is. We can only assume it is some POSIX OS because he mentions the ulimit command. Indeed, not even all Linuxes use that file, though many do. Perry From prentice at ias.edu Mon Jun 9 10:38:08 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] User resource limits In-Reply-To: <87k5gyv97s.fsf@snark.cb.piermont.com> References: <484D4F29.9090704@ias.edu> <20080609161232.GB11155@nlxdcldnl2.cl.intel.com> <87k5gyv97s.fsf@snark.cb.piermont.com> Message-ID: <484D6A80.20803@ias.edu> Perry E. Metzger wrote: > "Lombard, David N" writes: >> On Mon, Jun 09, 2008 at 11:41:29AM -0400, Prentice Bisbal wrote: >>> I would like to impose some CPU and memory limits on users that are hard >>> limits that can't be changed/overridden by the users. What is the best >>> way to do this? All I know is environment variables or shell commands >>> done as the user (ulimit, for example). >> pam_limits and /etc/security/limits.conf > > You're making assumptions about what OS he's running. He didn't say > which flavor of Unix this is. We can only assume it is some POSIX OS > because he mentions the ulimit command. Indeed, not even all Linuxes > use that file, though many do. > Yeah, my mistake - I forgot to include that important piece of data. My apologies. I'm running PU_IAS Linux 5.1. PU_IAS is a rebuild of RHEL, so anything that applies to RHEL applies to PU_IAS. http://plug.princeton.edu/linux/ I think David was assuming I was running Linux, and he was correct. thanks for your help. I have to go read some man pages now. Prentice From kus at free.net Mon Jun 9 15:01:40 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition Message-ID: A lot of time ago it was formulated simple rule for swap partition size (equal to main memory size). Currently we all have relative large RAM on the nodes (typically, I beleive, it is 2 or more GB per core; we have 16 GB per dual-socket quad-core Opteron node). What is typical modern swap size today? I understand that it depends from applications ;-) We, in particular, practically don't have jobs which run "out-of-RAM". For single core dual-socket Opteron nodes w/4GB RAM per node and "molecular modelling workload" we used 4 GB swap partition. But what are the reccomendations of modern praxis ? Mikhail Kuzminksy Computer Assistance to Chemical Research Center Zelinsky Inst. of Organic Chemistry Moscow From gerry.creager at tamu.edu Mon Jun 9 15:51:34 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: References: Message-ID: <484DB3F6.7010904@tamu.edu> Misha, We have the potential to have to swap whole jobs out of memory on a complete node. As a result, I recommend 1.5-2.0 times memory in swap if this is a consideration. I do know there's likely to be a bit of discussion as this varies widely from site to site and based on requirements. gerry Mikhail Kuzminsky wrote: > A lot of time ago it was formulated simple rule for swap partition size > (equal to main memory size). > > Currently we all have relative large RAM on the nodes (typically, I > beleive, it is 2 or more GB per core; we have 16 GB per dual-socket > quad-core Opteron node). What is typical modern swap size today? > > I understand that it depends from applications ;-) We, in particular, > practically don't have jobs which run "out-of-RAM". For single core > dual-socket Opteron nodes w/4GB RAM per node and "molecular modelling > workload" we used 4 GB swap partition. > > But what are the reccomendations of modern praxis ? > > Mikhail Kuzminksy > Computer Assistance to Chemical Research Center > Zelinsky Inst. of Organic Chemistry > Moscow _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From kyron at neuralbs.com Mon Jun 9 18:28:28 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: <484DB3F6.7010904@tamu.edu> References: <484DB3F6.7010904@tamu.edu> Message-ID: <484DD8BC.10208@neuralbs.com> Mikhail, Somewhat like Gerry said, ballpark figures have always been an arbitrary 1.5*RAM. This is completely ridiculous nowadays and should depend entirely on the applications you run. Typically, you should never swap out memory on a running application. I recommend you perform some metrics collection, doesn't have to be perfect and super-fine-grained. Something like Ganglia should be sufficient to give you an idea of how much swap you need, if ever you actually hit it...but don't! Eric PS: this is a redundant topic on the list ...do a little searching and you'll hit it ;) Gerry Creager wrote: > Misha, > > We have the potential to have to swap whole jobs out of memory on a > complete node. As a result, I recommend 1.5-2.0 times memory in swap > if this is a consideration. I do know there's likely to be a bit of > discussion as this varies widely from site to site and based on > requirements. > > gerry > > Mikhail Kuzminsky wrote: >> A lot of time ago it was formulated simple rule for swap partition size >> (equal to main memory size). >> >> Currently we all have relative large RAM on the nodes (typically, I >> beleive, it is 2 or more GB per core; we have 16 GB per dual-socket >> quad-core Opteron node). What is typical modern swap size today? >> >> I understand that it depends from applications ;-) We, in particular, >> practically don't have jobs which run "out-of-RAM". For single core >> dual-socket Opteron nodes w/4GB RAM per node and "molecular modelling >> workload" we used 4 GB swap partition. >> >> But what are the reccomendations of modern praxis ? >> >> Mikhail Kuzminksy >> Computer Assistance to Chemical Research Center >> Zelinsky Inst. of Organic Chemistry >> Moscow _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > From csamuel at vpac.org Mon Jun 9 20:56:28 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] User resource limits In-Reply-To: <565153331.116881213069858428.JavaMail.root@zimbra.vpac.org> Message-ID: <742070125.116971213070188917.JavaMail.root@zimbra.vpac.org> ----- "Prentice Bisbal" wrote: > I think David was assuming I was running Linux, and he was correct. > thanks for your help. I have to go read some man pages now. Be very aware that there are two different ulimits that affect memory allocations, *depending on the size of the allocation that is asked for*, if you have glibc 2.3 or newer (so most distros still in use). For allocations < 128KB the standard memory limit is applied as it uses brk(), but for allocations greater than that it uses mmap(). Unfortunately the kernel implementation of mmap() doesn't check the maximum memory size (RLIMIT_RSS) or maximum data size (RLIMIT_DATA) limits which were being set, but only the maximum virtual RAM size (RLIMIT_AS) - this is documented in the setrlimit(2) man page. :-( -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From hahn at mcmaster.ca Mon Jun 9 21:58:12 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: <484DB3F6.7010904@tamu.edu> References: <484DB3F6.7010904@tamu.edu> Message-ID: > We have the potential to have to swap whole jobs out of memory on a complete > node. that was our intent as well. among other things, this scheme enables running the cluster "split-personality" - mostly shorter/smaller even interactive jobs during the day, with big/long jobs running at night. unfortunately, you need a smart scheduler to do this, and ours is dumb. >> beleive, it is 2 or more GB per core; we have 16 GB per dual-socket >> quad-core Opteron node). What is typical modern swap size today? are you willing to use a node which is actually occupying 16 GB of swap? it is possible to tune how the kernel responds to memory crunches - for instance, you can always avoid OOM with the vm.overcommit_memory=2 sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap to get the desired limits.) in this mode, the kernel tracks how much VM it actually needs (worst-case, reflected in Committed_AS in /proc/meminfo) and compares that to a commit limit that reflects ram and swap. if you don't use overcommit_memory=2, you are basically borrowing VM space in hopes of not needing it. that can still be reasonable, considering how often processes have a lot of shared VM, and how many processes allocate but never touch lots of pages. but you have to ask yourself: would I like a system that was actually _using_ 16 GB of swap? if you have 16x disks, perhaps, but 16G will suck if you only have 1 disk. at least for overcommit_memory != 2, I don't see the point of configuring a lot of swap, since the only time you'd use it is if you were thrashing. sort of a "quality of life" argument. >> But what are the reccomendations of modern praxis ? it depends a lot on the size variance of your jobs, as well as their real/virtual ratio. the kernel only enforces RLIMIT_AS (vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did RLIMIT_RSS or not. if you use overcommit_memory=2, your desired max VM size determines the amount of swap. otherwise, go with something modest - memory size or so. but given that the smallest reasonable single disk these days is probably about 320GB, it's hard to justify being _too_ tight. From jclinton at advancedclustering.com Thu Jun 5 08:38:42 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <4847B410.1070202@lfbs.rwth-aachen.de> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> <20080605001549.GE27430@bx9.net> <4847B410.1070202@lfbs.rwth-aachen.de> Message-ID: <588c11220806050838t5d2ede3fh3e2e3880869ba4df@mail.gmail.com> On Thu, Jun 5, 2008 at 4:38 AM, Rainer Finocchiaro < rainer@lfbs.rwth-aachen.de> wrote: > Hi Michael, > > Greg Lindahl schrieb: > >> All the OFED rpm's for FC6 installed on FC8 without difficulty, except for >>> opensm-3.0.3-0.ppc64.rpm >>> >> >> This is the cause of most of your subsequent problems. Without an SM >> running somewhere on your network, the links don't come fully up. >> ... > > > ... > Following your link, I reach a download directory offering only ppc64-RPMs; > in fact all precompiled RPMs for OFED-1.2.5 are for Power PC and not for > x86. > > .. > Much better is to download more up-to-date OFED-1.3 sources. The package > includes an install script, which builds and installs the RPMs for you. So > you don't have to "fear" to install something which is not controlled by > your package management system (RPM). As a side note, you've probably gotten yourself in to an unrecoverable state with RPM having already installed all those PPC RPM's on your Fedora 8 x86_64 systems. The easiest thing to do is probably reinstall but if you want, you can try removing them all with something like this: cd /path/to/downloaded/RPMS ls | grep -oP .+?\(\?=.x86_64\\.rpm\) | xargs rpm -e The command will extract the names of the RPM's known by RPM that you installed and then ask RPM to remove them. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080605/7bd7a474/attachment.html From jclinton at advancedclustering.com Thu Jun 5 08:41:14 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] OFED/IB for FC8 In-Reply-To: <588c11220806050838t5d2ede3fh3e2e3880869ba4df@mail.gmail.com> References: <6.2.5.6.2.20080604150239.047ad270@NumerEx-LLC.com> <20080605001549.GE27430@bx9.net> <4847B410.1070202@lfbs.rwth-aachen.de> <588c11220806050838t5d2ede3fh3e2e3880869ba4df@mail.gmail.com> Message-ID: <588c11220806050841r743b0246rea9f940e5c8c7753@mail.gmail.com> On Thu, Jun 5, 2008 at 10:38 AM, Jason Clinton < jclinton@advancedclustering.com> wrote: > ls | grep -oP .+?\(\?=.x86_64\\.rpm\) | xargs rpm -e > Of course, replace "x86_64" with "ppc64" if indeed that is what you installed. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080605/e3be0b3d/attachment.html From jclinton at advancedclustering.com Thu Jun 5 10:46:54 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: References: Message-ID: <588c11220806051046t3ad48a02q449a3ede0e299884@mail.gmail.com> On Thu, Jun 5, 2008 at 11:39 AM, Mikhail Kuzminsky wrote: > In message from Mark Hahn (Thu, 5 Jun 2008 11:57:28 > -0400 (EDT)): > >> To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error >>> or w/o error ? >>> >> >> AMD, like Intel, does a reasonable job of disclosing such info: >> >> >> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF >> >> the well-known problem is erattum 298, I think, and fixed in B3. >> > > Yes, this AMD errata document says that in B3 revision the error "will be > fixed". I heard that new CPUs w/o TLB+L3 error are shipped now, > but are this CPUs really B3 or may be have some more new release ? Yes, what are currently shipping from AMD are B3 revision processors. The TLB-look-aside problem is fixed. There are other less-critical problems with B3, however. Specifically, power-related compatibility issues with various motherboards due to (according to the motherboard manufacturers) AMD changing the TDP late in the release process. I can't give any specific names or models that we know have problems, however. I can say that everyone involved is working on a resolution--usually through PCB revisions of the motherboards. A number of 1U power supplies that have previously worked with all Intel and AMD solutions are now insufficient, as well, due to 12V limitations. B3 pulls a *lot* of power. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080605/fe201ac3/attachment.html From jclinton at advancedclustering.com Thu Jun 5 11:16:33 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: References: Message-ID: <588c11220806051116i37ff7aa1oec16a85a24009592@mail.gmail.com> On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky wrote: > In message from Mark Hahn (Thu, 5 Jun 2008 13:55:01 > -0400 (EDT)): > >> I'm mystified by this: B2 was broken, so using it without the bios >> workaround is just a mistake or masochism. the workaround _did_ apparently >> have performance implications, but that's why B3 exists... >> >> do you mean you know of G03 problems on B2 systems which are operating >> _with_ the workaround? >> > > I don't know exactly, but I think the crash was under absence of > workaround, because I was not informed that there was some kernel patches or > BIOS changes. This was interesting for me also, because I have no > information how this hardware problem may be affected in the "real life". > Mikhail > The B2 BIOS work-around is to disable the L3 cache which gives you a 10-20% performance hit with no reduction in power consumption. The kernel patch is very extensive and, last I heard, under NDA. AMD has said publicly that the patch gives you a 1-2% performance hit. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080605/34e021ab/attachment.html From malallen at indiana.edu Fri Jun 6 11:10:43 2008 From: malallen at indiana.edu (Matt Allen) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <87hcc6h2yi.fsf@snark.cb.piermont.com> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> Message-ID: <17D5930B-CC49-49CF-A33C-6E7B01401FC1@indiana.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > cool tricks to consistently set the BIOS We had a cluster of systems that supported configuring the BIOS from an image on a bootable floppy. I bought 96 3.5" floppy disks, put one in each node, and then used parallel scp to dd the desired image to each node's floppy from an NFS mount. Then I power-cycled them simultaneously, and listened to the sound of 96 floppy disks being read at the same time (more or less). I'm sure I'll never hear that sound again in my life. I'm not sure how relevant or cool that was (and it did take a few minutes to eject all those disks afterwards), but it took less time than rebooting each node, for sure, and I had a desk full of spare floppy disks for two or three years after that. Matt - -- 812.855.7318 voice Research Technologies - High-Performance Systems hps-admin@iu.edu - http://rtinfo.uits.indiana.edu/hps/ On Jun 6, 2008, at 1:45 PM, Perry E. Metzger wrote: > > Bill Broadley writes: >>> 2. BIOS had a couple of interesting defaults, including warn on >>> keyboard error (Keyboard? Not intentionally. This is a compute >>> node, and should never require a keyboard. Ever.) We also find the >>> BIOS is set to boot from hard disk THEN PXE. But due to item 1, >>> above, we never can fail over to PXE unless we load up a keyboard >>> and monitor, and hit F12 to drop to PXE. >> >> Very strange standard for a server, let alone a cluster node. > > I would be less disturbed about such things if it was trivial to alter > the BIOS settings in a semi-automated way -- say by booting some > standalone program, or loading a file from a USB thumb drive. Then you > could just go up to each box with a USB thumb drive, turn it on, and > have it fix itself in a consistent way. However, the fact that you > can't generally automate fixing BIOS settings makes all of this far > more annoying. > > Anyone have any cool tricks for how to consistently set the BIOS on > large numbers of boxes without requiring steps that humans can screw > up easily? > > -- > Perry E. Metzger perry@piermont.com > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) iEYEARECAAYFAkhJfaMACgkQsHrhTcWK+IZ2GwCeOYae5FD3OrApTAJ3U2hPXfip BtEAnA9Ub3kkoKbFtNOcJgl7vHAi3KO2 =qlG4 -----END PGP SIGNATURE----- From bari at onelabs.com Fri Jun 6 12:14:28 2008 From: bari at onelabs.com (bari) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> <9FD5E5B6-BCAF-4C00-A02B-A3678E46C77D@sanger.ac.uk> Message-ID: <48498C94.1010608@onelabs.com> Tim Cutts wrote: > > Nope. :-) This is, in my view, one of the major disadvantages of PC > clusters. The crappy old BIOS that we're stuck with. > Just out of curiosity beside the clusters at LANL and Sandia who here uses coreboot (LinuxBIOS) for BIOS? http://www.coreboot.org If not, why not? Lack of vendor support? -Bari From spambox at emboss.co.nz Sat Jun 7 12:03:07 2008 From: spambox at emboss.co.nz (Michael Brown) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <87hcc6h2yi.fsf@snark.cb.piermont.com> References: <48495A43.4060809@tamu.edu> <48496D2D.5080801@cse.ucdavis.edu> <87hcc6h2yi.fsf@snark.cb.piermont.com> Message-ID: <5ECDF0CF16C448CCA570A7C5DCBDBA71@Forethought> Perry E. Metzger wrote: > Anyone have any cool tricks for how to consistently set the BIOS on > large numbers of boxes without requiring steps that humans can screw > up easily? Get a USB stick that boots into Linux. Set up one machine the way you want, then boot it up using the USB stick. Do: dd if=/dev/nvram of=cmos.bin For each oth the other machines, boot them using the stick and do: dd if=cmos.bin of=/dev/nvram From johnh at streamline-computing.com Mon Jun 9 06:55:17 2008 From: johnh at streamline-computing.com (johnh@streamline-computing.com) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <1213008547.8064.9.camel@bruce.priv.wark.uk.streamline-computing.com> References: <1213008547.8064.9.camel@bruce.priv.wark.uk.streamline-computing.com> Message-ID: <6c91fc1fbdbca86c512bb52ae3cfab4f@87.127.209.200> > On Fri, 2008-06-06 at 10:39 -0500, Gerry Creager wrote: >> > > I can think of at least one cluster where the opposite has been true and > PXE boot has been the default. The problem with this is if the head > node PXE boots on the customers network and gets automatically > re-installed as a windows workstation everybody gets egg on their face. > Yes even "modern" BIOSes are bad but localboot first is a sensible > default. Our clusters are set such that all compute nodes PXE boot first then localboot. All head nodes should have the BIOS set to locaboot first. From maurice at harddata.com Mon Jun 9 13:03:58 2008 From: maurice at harddata.com (Maurice Hilarius) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <200806091657.m59Gum9n021891@bluewest.scyld.com> References: <200806091657.m59Gum9n021891@bluewest.scyld.com> Message-ID: <484D8CAE.8000008@harddata.com> Chris Samuel wrote: > > Our most recent vendor went to the motherboard manufacturer > and said "please can you cut us a BIOS with these default > settings" and they did so. > > cheers, > Chris Some manufacturers do, some do not. Asus , for example, do, for their OEM customers. OTOH, one may buy the BIOS customization software , from AMI, for example, for a support/licensing fee of about $10,000 per year. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice@harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080609/858c4f5e/attachment.html From mark.kosmowski at gmail.com Tue Jun 10 06:44:29 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition Message-ID: > Message: 5 > Date: Tue, 10 Jun 2008 00:58:12 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] size of swap partition > To: Gerry Creager > Cc: Mikhail Kuzminsky , beowulf@beowulf.org > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > > We have the potential to have to swap whole jobs out of memory on a complete > > node. > > that was our intent as well. among other things, this scheme enables > running the cluster "split-personality" - mostly shorter/smaller even > interactive jobs during the day, with big/long jobs running at night. > unfortunately, you need a smart scheduler to do this, and ours is dumb. > > >> beleive, it is 2 or more GB per core; we have 16 GB per dual-socket > >> quad-core Opteron node). What is typical modern swap size today? > > are you willing to use a node which is actually occupying 16 GB of swap? > > it is possible to tune how the kernel responds to memory crunches - > for instance, you can always avoid OOM with the vm.overcommit_memory=2 > sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap > to get the desired limits.) in this mode, the kernel tracks how much VM > it actually needs (worst-case, reflected in Committed_AS in /proc/meminfo) > and compares that to a commit limit that reflects ram and swap. > > if you don't use overcommit_memory=2, you are basically borrowing VM > space in hopes of not needing it. that can still be reasonable, considering > how often processes have a lot of shared VM, and how many processes > allocate but never touch lots of pages. but you have to ask yourself: > would I like a system that was actually _using_ 16 GB of swap? if you > have 16x disks, perhaps, but 16G will suck if you only have 1 disk. > at least for overcommit_memory != 2, I don't see the point of configuring > a lot of swap, since the only time you'd use it is if you were thrashing. > sort of a "quality of life" argument. > > >> But what are the reccomendations of modern praxis ? > > it depends a lot on the size variance of your jobs, as well as > their real/virtual ratio. the kernel only enforces RLIMIT_AS > (vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did > RLIMIT_RSS or not. > > if you use overcommit_memory=2, your desired max VM size determines > the amount of swap. otherwise, go with something modest - memory size > or so. but given that the smallest reasonable single disk these days > is probably about 320GB, it's hard to justify being _too_ tight. Is anyone using those Gigabyte i-RAM type devices from swap? Or is RAM cheaper? What about using these devices as swap to "add RAM" to older equipment that is at the maximum mobo supported RAM limit? From walid.shaari at gmail.com Tue Jun 10 09:27:43 2008 From: walid.shaari at gmail.com (Walid) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: References: Message-ID: Hi, For an 8GB dual socket quad core node, choosing in the kick start file --recommended instead of specifying size RHEL5 allocates 1GB of memory. our developers say that they should not swap as this will cause an overhead, and they try to avoid it as much as possible regards Walid On 10/06/2008, Mark Kosmowski wrote: >> Message: 5 >> Date: Tue, 10 Jun 2008 00:58:12 -0400 (EDT) >> From: Mark Hahn >> Subject: Re: [Beowulf] size of swap partition >> To: Gerry Creager >> Cc: Mikhail Kuzminsky , beowulf@beowulf.org >> Message-ID: >> >> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed >> >> > We have the potential to have to swap whole jobs out of memory on a >> > complete >> > node. >> >> that was our intent as well. among other things, this scheme enables >> running the cluster "split-personality" - mostly shorter/smaller even >> interactive jobs during the day, with big/long jobs running at night. >> unfortunately, you need a smart scheduler to do this, and ours is dumb. >> >> >> beleive, it is 2 or more GB per core; we have 16 GB per dual-socket >> >> quad-core Opteron node). What is typical modern swap size today? >> >> are you willing to use a node which is actually occupying 16 GB of swap? >> >> it is possible to tune how the kernel responds to memory crunches - >> for instance, you can always avoid OOM with the vm.overcommit_memory=2 >> sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap >> to get the desired limits.) in this mode, the kernel tracks how much VM >> it actually needs (worst-case, reflected in Committed_AS in /proc/meminfo) >> and compares that to a commit limit that reflects ram and swap. >> >> if you don't use overcommit_memory=2, you are basically borrowing VM >> space in hopes of not needing it. that can still be reasonable, >> considering >> how often processes have a lot of shared VM, and how many processes >> allocate but never touch lots of pages. but you have to ask yourself: >> would I like a system that was actually _using_ 16 GB of swap? if you >> have 16x disks, perhaps, but 16G will suck if you only have 1 disk. >> at least for overcommit_memory != 2, I don't see the point of configuring >> a lot of swap, since the only time you'd use it is if you were thrashing. >> sort of a "quality of life" argument. >> >> >> But what are the reccomendations of modern praxis ? >> >> it depends a lot on the size variance of your jobs, as well as >> their real/virtual ratio. the kernel only enforces RLIMIT_AS >> (vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did >> RLIMIT_RSS or not. >> >> if you use overcommit_memory=2, your desired max VM size determines >> the amount of swap. otherwise, go with something modest - memory size >> or so. but given that the smallest reasonable single disk these days >> is probably about 320GB, it's hard to justify being _too_ tight. > > Is anyone using those Gigabyte i-RAM type devices from swap? Or is > RAM cheaper? What about using these devices as swap to "add RAM" to > older equipment that is at the maximum mobo supported RAM limit? > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From kus at free.net Tue Jun 10 10:35:46 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: Message-ID: In message from Mark Hahn (Tue, 10 Jun 2008 00:58:12 -0400 (EDT)): ... >for instance, you can always avoid OOM with the vm.overcommit_memory=2 >sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap >to get the desired limits.) in this mode, the kernel tracks how much >VM >it actually needs (worst-case, reflected in Committed_AS in >/proc/meminfo) >and compares that to a commit limit that reflects ram and swap. > >if you don't use overcommit_memory=2, you are basically borrowing VM >space in hopes of not needing it. that can still be reasonable, >considering >how often processes have a lot of shared VM, and how many processes >allocate but never touch lots of pages. but you have to ask yourself: >would I like a system that was actually _using_ 16 GB of swap? if you >have 16x disks, perhaps, but 16G will suck if you only have 1 disk. >at least for overcommit_memory != 2, I don't see the point of >configuring >a lot of swap, since the only time you'd use it is if you were >thrashing. >sort of a "quality of life" argument. > >>> But what are the reccomendations of modern praxis ? > >it depends a lot on the size variance of your jobs, as well as their >real/virtual ratio. the kernel only enforces RLIMIT_AS >(vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did >RLIMIT_RSS or not. > >if you use overcommit_memory=2, your desired max VM size determines >the amount of swap. otherwise, go with something modest - memory size >or so. but given that the smallest reasonable single disk these days >is probably about 320GB, it's hard to justify being _too_ tight. :-) The disks we use in nodes is SATA WD/10K RPM w/70 GB :-)) We didn't set overcommit_memory=2, but really use strongly restricted scheduling police for SGE batch jobs using only few applications. We have only batch jobs (no interactive), moreover - practically only *long batch jobs*. As a result we have summary VM (requested per node) equal (or lower) than RAM. There is practically zero swap activity. The only exclusion are (seldom executed) small test jobs, non-parallelized, mainly for check of input data. They use small RAM amount. So it looks for me that I may set even lower than 1.5*RAM swap size (I think RAM+4G = 20G will be enough). In message from Walid (Tue, 10 Jun 2008 19:27:43 +0300): >Hi, >For an 8GB dual socket quad core node, choosing in the kick start >file --recommended instead of specifying size RHEL5 allocates 1GB of >memory. our developers say that they should not swap as this will >cause an overhead, and they try to avoid it as much as possible OpenSuSE 10.3 recommends swap size=2 GB only, but I don't know, performs SuSE inst software some estimation of server RAM or no. Yours Mikhail From csamuel at vpac.org Tue Jun 10 18:33:28 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <6c91fc1fbdbca86c512bb52ae3cfab4f@87.127.209.200> Message-ID: <1674199748.126391213148008004.JavaMail.root@zimbra.vpac.org> ----- johnh@streamline-computing.com wrote: > All head nodes should have the BIOS set to locaboot first. We set the interface on the internal cluster network to PXE and the external to not. Mind you, we control the external network too, so even if it did try it shouldn't do anything. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Jun 10 18:43:01 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: <436509742.126541213148491810.JavaMail.root@zimbra.vpac.org> Message-ID: <457290452.126561213148581328.JavaMail.root@zimbra.vpac.org> ----- "Jason Clinton" wrote: > The kernel patch is very extensive and, last I heard, under NDA. AMD post the patches publicly to the x86-64 discuss list. The most recent ones covered 2.6.24 and 2.6.25 and were sent out in April. https://www.x86-64.org/pipermail/discuss/2008-April/010398.html -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From Dan.Kidger at quadrics.com Thu Jun 12 04:38:36 2008 From: Dan.Kidger at quadrics.com (Dan.Kidger@quadrics.com) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] A couple of interesting comments In-Reply-To: <1674199748.126391213148008004.JavaMail.root@zimbra.vpac.org> References: <6c91fc1fbdbca86c512bb52ae3cfab4f@87.127.209.200> <1674199748.126391213148008004.JavaMail.root@zimbra.vpac.org> Message-ID: <0D49B15ACFDF2F46BF90B6E08C90048A04884916EC@quadbrsex1.quadrics.com> Chris Samuel wrote: >----- johnh@streamline-computing.com wrote: >> All head nodes should have the BIOS set to localboot first. > >We set the interface on the internal cluster network to >PXE and the external to not. I agree. but note that if you use ROCKS, it insists on the other way round: It wants to always reinstall a node on power up *unless* it has a working bootable partition, and it deliberately trashes the boot sector on a clean boot - only replacing it if the node is shut down cleanly. The key point is that any machine that has state that should be kept (like a head node) should *never* PXE boot by default - possibly not even if the primary HDD won't boot properly - only PXE boot on human intervention. PXE booting is concept of trust - that you trust the machine upstream of you to have full control of you, to delete and reinstall whatever it wants. Daniel ------------------------------------------------------------- Dr. Daniel Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Mobile: +44 (0)779 209 1851 Bristol, BS1 2AA, UK Office: +44 (0)117 915 5519 ----------------------- www.quadrics.com -------------------- From walid.shaari at gmail.com Thu Jun 12 06:32:38 2008 From: walid.shaari at gmail.com (Walid) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] RHEL5 network throughput/scalability Message-ID: Hi All, I have an issue with a new cluster setup where the nodes are RHEL5.1(with the latest 5.2 kernel), when i try to write NFS data, the nodes scale linearly until they reach the 10th node, that is the bandwidth , and throughput seen from the NFS sever on the other side of the nodes shows a liner increment from around 100+Mbyte/sec up to 1Gbyte/sec, however when we add another extra node to the equation the bandwidth/throughput becomes erratic/inconsistent, and drops to around 500-700Mbyte/sec. however if i try the same setup with RHEL4U6 i do not get the same behaviour it sustains the bandwidth at 1Gbyte/sec. the setup is like this 48 nodes sharing 48 port access switch that is up linked using 10g link to a CISCO 6509 switch which is linked to a Clustered NFS File system that consist of eight heads where each head linked using a 10G link to the 6509. the above was a write test, so i thought may be the tcp congestion kicked in, or sliding windows problem, however when i do a read test it gets worse, the scalability now is reduced to 5 nodes that is one node is able to read around 100 MBps, two will read double, and so on until you add the fifth node where the bandwidth drops from around 500+MBps to around 300, and again from RHEL4 the behaviour is different. any pointers? TIA Walid -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080612/4e8f86fa/attachment.html From garantes at iq.usp.br Tue Jun 10 06:53:15 2008 From: garantes at iq.usp.br (Guilherme Menegon Arantes) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] size of swap partition In-Reply-To: <200806101313.m5ADC2t3014136@bluewest.scyld.com> References: <200806101313.m5ADC2t3014136@bluewest.scyld.com> Message-ID: <20080610135315.GA4894@dinamobile> On Tue, Jun 10, 2008 at 06:13:00AM -0700, beowulf-request@beowulf.org wrote: > > Date: Tue, 10 Jun 2008 00:58:12 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] size of swap partition > To: Gerry Creager > Cc: Mikhail Kuzminsky , beowulf@beowulf.org > > it is possible to tune how the kernel responds to memory crunches - > for instance, you can always avoid OOM with the vm.overcommit_memory=2 > sysctl (you'll need to tune vm.overcommit_ratio and the amount of swap > to get the desired limits.) in this mode, the kernel tracks how much VM > it actually needs (worst-case, reflected in Committed_AS in /proc/meminfo) > and compares that to a commit limit that reflects ram and swap. > > ... > > their real/virtual ratio. the kernel only enforces RLIMIT_AS > (vsz in ps),assuming a 2.6 kernel - I forget whether 2.4 did > RLIMIT_RSS or not. And that brings me to another related question: Where I can get more information about VM usage/tunning options (such as vm.overcommit_ratio, RLIMIT_AS, etc) and VM metrics (such as vmstat) for the current linux kernels (>= 2.6.18)? I have looked on /usr/src/linux/Documentation/vm but anywhere else with more accessible/digested info? Regards, Guilherme -- Guilherme Menegon Arantes, PhD S?o Paulo, Brasil ______________________________________________________ From raq at cttc.upc.edu Tue Jun 10 10:33:30 2008 From: raq at cttc.upc.edu (Ramiro Alba Queipo) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches Message-ID: <1213119210.8051.143.camel@mundo> Hello everybody: We are about to build an HPC cluster with infiniband network starting from 22 dual socket nodes with AMD QUAD core processors and in a year or so we will be having about 120 nodes. We will be using infiniband both for calculation as for storage. The question is that we need a modular solution and we are having 3 candidates: a) Voltaire Grid Director SDR or DDR 288 ports (9988 or 2012 models)-> seems very good and well supported, but very expensive. b) Qlogic SilverStorm 9120 (144 ports) -> no price and support information yet c) Flextronics 10U 144 Port Modular-> very good at price but little support => risky option?. I am in a mess. What is your opinion about this matter? Are you using any of this products. Regards -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que està net. For all your IT requirements visit: http://www.transtec.co.uk From landman at scalableinformatics.com Thu Jun 12 07:36:32 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches In-Reply-To: <1213119210.8051.143.camel@mundo> References: <1213119210.8051.143.camel@mundo> Message-ID: <48513470.30308@scalableinformatics.com> Ramiro Alba Queipo wrote: > Hello everybody: > > We are about to build an HPC cluster with infiniband network starting > from 22 dual socket nodes with AMD QUAD core processors and in a year or > so we will be having about 120 nodes. We will be using infiniband both > for calculation as for storage. Hi Ramiro: You may experience some contention issues in this case if your code is very latency sensitive, and you do lots of IO. > The question is that we need a modular solution and we are having 3 > candidates: > > a) Voltaire Grid Director SDR or DDR 288 ports (9988 or 2012 models)-> > seems very good and well supported, but very expensive. > > b) Qlogic SilverStorm 9120 (144 ports) -> no price and support > information yet > > c) Flextronics 10U 144 Port Modular-> very good at price but little > support => risky option?. The Flextronics units are Mellanox IP/chips inside (as are, I believe, many/most of the others). That is, the risk is low from a "will it work" view. Flextronics is an ODM, so they may not provide the levels of support around the system that you might get with Voltaire et al. Do you want/need a 1:1 architecture (e.g. all ports are the same number of switch hops from each other), or are you able/willing to look into oversubscribed links? Part of this has to do with your traffic patterns, your code requirements on latency, and your storage bandwidth. The Voltaire units are good, we have used them in units for customers. No complaints. Flextronics should be fine, as should Qlogic. We have customers with all of these. Rarely hear of complaints on IB switches. > > I am in a mess. What is your opinion about this matter? Are you using > any of this products. > > Regards > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From djholm at fnal.gov Thu Jun 12 08:08:21 2008 From: djholm at fnal.gov (Don Holmgren) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches In-Reply-To: <1213119210.8051.143.camel@mundo> References: <1213119210.8051.143.camel@mundo> Message-ID: Ramiro - You might want to also consider buying just a single 24-port switch for your 22 nodes, and then when you expand either replace with a larger switch, or build a distributed switch fabric with a number of leaf switches connecting into a central spine switch (or switches). By the time you expand to the larger cluster, switches based on the announced 36-port Mellanox crossbar silicon will be available and perhaps per port prices will have dropped sufficiently to justify the purchase delay and the disruption at the time of expansion. If your applications can tolerate some oversubscription (less than a 1:1 ratio of leaf-to-spine uplinks to leaf-to-node connections), a distributed switch fabric (leaf and spine) has the advantage of shorter (and cheaper) cables between the leaf switches and your nodes, and relatively fewer longer cables from the leaves back to the spine, compared with a single central switch. We have many Flextronics switches - SDR and DDR, 24-port and 144-port - on a pair of large clusters (520 nodes, and 600 nodes) built in 2005 and 2006. No complaints. But, we have been self-supporting, and I would guess you would have very different support structures with Voltaire or Qlogic. With the Flextronics switches you will definitely be using the OFED stack, and you will have to run a subnet manager on one of your nodes (dedicated is probably best). You could optionally buy an embedded subnet manager on the Voltaire or Qlogic switches, depending upon model, though I believe for a large fabric an external subnet manager is still recommended. Don Holmgren Fermilab On Tue, 10 Jun 2008, Ramiro Alba Queipo wrote: > Hello everybody: > > We are about to build an HPC cluster with infiniband network starting > from 22 dual socket nodes with AMD QUAD core processors and in a year or > so we will be having about 120 nodes. We will be using infiniband both > for calculation as for storage. > The question is that we need a modular solution and we are having 3 > candidates: > > a) Voltaire Grid Director SDR or DDR 288 ports (9988 or 2012 models)-> > seems very good and well supported, but very expensive. > > b) Qlogic SilverStorm 9120 (144 ports) -> no price and support > information yet > > c) Flextronics 10U 144 Port Modular-> very good at price but little > support => risky option?. > > I am in a mess. What is your opinion about this matter? Are you using > any of this products. > > Regards From andrew at moonet.co.uk Thu Jun 12 08:51:11 2008 From: andrew at moonet.co.uk (andrew holway) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches In-Reply-To: References: <1213119210.8051.143.camel@mundo> Message-ID: +1 for the 24 port flextronics switches. They are very cost effective for half bisectional networks upto 32 ports. It starts to get messy after that. I wonder how long we will be waiting for switches based on the 36p asic? On Thu, Jun 12, 2008 at 4:08 PM, Don Holmgren wrote: > > Ramiro - > > You might want to also consider buying just a single 24-port switch for your > 22 nodes, and then when you expand either replace with a larger switch, or > build a distributed switch fabric with a number of leaf switches connecting > into a central spine switch (or switches). By the time you expand to the > larger cluster, switches based on the announced 36-port Mellanox crossbar > silicon will be available and perhaps per port prices will have dropped > sufficiently to justify the purchase delay and the disruption at the time of > expansion. > > If your applications can tolerate some oversubscription (less than a 1:1 > ratio of leaf-to-spine uplinks to leaf-to-node connections), a distributed > switch fabric (leaf and spine) has the advantage of shorter (and cheaper) > cables between the leaf switches and your nodes, and relatively fewer longer > cables from the leaves back to the spine, compared with a single central > switch. > > We have many Flextronics switches - SDR and DDR, 24-port and 144-port - on a > pair of large clusters (520 nodes, and 600 nodes) built in 2005 and 2006. No > complaints. But, we have been self-supporting, and I would guess you would > have very different support structures with Voltaire or Qlogic. With the > Flextronics > switches you will definitely be using the OFED stack, and you will have to > run > a subnet manager on one of your nodes (dedicated is probably best). You > could > optionally buy an embedded subnet manager on the Voltaire or Qlogic > switches, > depending upon model, though I believe for a large fabric an external subnet > manager is still recommended. > > Don Holmgren > Fermilab > > > > > On Tue, 10 Jun 2008, Ramiro Alba Queipo wrote: > >> Hello everybody: >> >> We are about to build an HPC cluster with infiniband network starting >> from 22 dual socket nodes with AMD QUAD core processors and in a year or >> so we will be having about 120 nodes. We will be using infiniband both >> for calculation as for storage. >> The question is that we need a modular solution and we are having 3 >> candidates: >> >> a) Voltaire Grid Director SDR or DDR 288 ports (9988 or 2012 models)-> >> seems very good and well supported, but very expensive. >> >> b) Qlogic SilverStorm 9120 (144 ports) -> no price and support >> information yet >> >> c) Flextronics 10U 144 Port Modular-> very good at price but little >> support => risky option?. >> >> I am in a mess. What is your opinion about this matter? Are you using >> any of this products. >> >> Regards > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From richard.walsh at comcast.net Thu Jun 12 09:12:34 2008 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Is PowerXCell eDP fully IEEE 754 compliant ... ?? ... the old Cell is/was ... Message-ID: <061220081612.24009.48514AF20003BC1C00005DC92215567074089C040E99D20B9D0E080C079D@comcast.net> All, I have not been able to get an exact answer to this question. The older chip, while much slower in double-precision was fully IEEE compliant I am fairly sure. I believe that IBM has improved the compliance of single-precision in the PowerXCell (although it is still not fully compliant), but its double- precision has fallen back to the single precision level of compliance to allow for the performance boost. This leaves it at close to parity in compliance when compared to the competition (ATI and NVIDIA) in the volume-economics DLP accelerator arena. Could someone that is certain of the answer to this question clarify? Best Regards, rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 Phone #: 612-382-4620 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080612/4f5f0370/attachment.html From Shainer at mellanox.com Thu Jun 12 10:01:11 2008 From: Shainer at mellanox.com (Gilad Shainer) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches In-Reply-To: Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0129E5FB@mtiexch01.mti.com> > +1 for the 24 port flextronics switches. They are very cost effective > for half bisectional networks upto 32 ports. It starts to get > messy after that. > > I wonder how long we will be waiting for switches based on > the 36p asic? > Mellanox announced the availability of the switch asic this week, and can provide switch evaluation kits (36 port box and adapters with IB QDR capability) now. My estimation is that the production switches will be out Q3. Gilad. From andrew at moonet.co.uk Thu Jun 12 10:21:43 2008 From: andrew at moonet.co.uk (andrew holway) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Infiniband modular switches In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F0129E5FB@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F0129E5FB@mtiexch01.mti.com> Message-ID: > Mellanox announced the availability of the switch asic this week, and > can provide switch evaluation kits (36 port box and adapters with IB QDR > capability) now. My estimation is that the production switches will be > out Q3. Which vendor? From bill at cse.ucdavis.edu Thu Jun 12 11:04:19 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Barcelona hardware error: how to detect In-Reply-To: <588c11220806051046t3ad48a02q449a3ede0e299884@mail.gmail.com> References: <588c11220806051046t3ad48a02q449a3ede0e299884@mail.gmail.com> Message-ID: <48516523.9010406@cse.ucdavis.edu> > Yes, what are currently shipping from AMD are B3 revision processors. The > TLB-look-aside problem is fixed. > > There are other less-critical problems with B3, however. Specifically, > power-related compatibility issues with various motherboards due to > (according to the motherboard manufacturers) AMD changing the TDP late in > the release process. I can't give any specific names or models that we know > have problems, however. I can say that everyone involved is working on a > resolution--usually through PCB revisions of the motherboards. A number of > 1U power supplies that have previously worked with all Intel and AMD > solutions are now insufficient, as well, due to 12V limitations. B3 pulls a > *lot* of power. I've heard reports of b3 pulling more power than b2, not sure if that's just the higher clock speed, or a b3 related change. Has anyone put a dual socket B3 system on a kill-a-watt and tested it under load? From bernard at vanhpc.org Thu Jun 12 12:45:02 2008 From: bernard at vanhpc.org (Bernard Li) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture Message-ID: Hi all: I am sure most people have seen the following picture for Roadrunner circulating the Net: http://www.cnn.com/2008/TECH/06/09/fastest.computer.ap/index.html?iref=newssearch However, they don't look likes blades to me, more like 2U IBM x series servers. Perhaps those are the I/O nodes? Cheers, Bernard From prentice at ias.edu Thu Jun 12 13:03:07 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] User resource limits In-Reply-To: <742070125.116971213070188917.JavaMail.root@zimbra.vpac.org> References: <742070125.116971213070188917.JavaMail.root@zimbra.vpac.org> Message-ID: <485180FB.2090402@ias.edu> Chris Samuel wrote: > > Unfortunately the kernel implementation of mmap() doesn't check > the maximum memory size (RLIMIT_RSS) or maximum data size (RLIMIT_DATA) > limits which were being set, but only the maximum virtual RAM size > (RLIMIT_AS) - this is documented in the setrlimit(2) man page. > > :-( > Yeah... I was just reading the setrlimit man page. Does that mean that the only way I can limit RAM usage is with RLIMIT_AS? (Or "as" in limits.conf parlance) I would have to limit AS < RAM to keep a user from using all RAM. Since AS includes virtual memory, and VM = RAM + swap, wouldn't I be limiting users a little more than I'd hoped? -- Prentice From prentice at ias.edu Thu Jun 12 13:07:18 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture In-Reply-To: References: Message-ID: <485181F6.8030904@ias.edu> Bernard Li wrote: > Hi all: > > I am sure most people have seen the following picture for Roadrunner > circulating the Net: > > http://www.cnn.com/2008/TECH/06/09/fastest.computer.ap/index.html?iref=newssearch > > However, they don't look likes blades to me, more like 2U IBM x series > servers. Perhaps those are the I/O nodes? > Perhaps that is a poorly chosen file photo, and not really a photo of Roadrunner? Your I/O node theory is plausible, too. Is IBM getting away from the BlueGene architecture From john.leidel at gmail.com Thu Jun 12 13:09:32 2008 From: john.leidel at gmail.com (John Leidel) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture In-Reply-To: References: Message-ID: <1213301372.5092.0.camel@e521.site> Also at ComputerWorld: http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9085021&intsrc=news_ts_head On Thu, 2008-06-12 at 12:45 -0700, Bernard Li wrote: > Hi all: > > I am sure most people have seen the following picture for Roadrunner > circulating the Net: > > http://www.cnn.com/2008/TECH/06/09/fastest.computer.ap/index.html?iref=newssearch > > However, they don't look likes blades to me, more like 2U IBM x series > servers. Perhaps those are the I/O nodes? > > Cheers, > > Bernard > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From peter.st.john at gmail.com Thu Jun 12 13:09:43 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture In-Reply-To: References: Message-ID: Bernard, I'm looking forward to hearing from our resident experts, but meanwhile: http://en.wikipedia.org/wiki/IBM_Roadrunner exlains the architecture some. The buzzword is "triblade", which is 3 blades (with an extension) employing two types of processors (AMD Opteron and IBM Cell) in a hybrid subsystem. I have no idea what a single Triblade looks like. The overallmachine is then composed of zillions of triblades. Wow,imagine a Beowulf of those (jk :-) Peter (designing a Beowulf of abaci to fit his current budget) On Thu, Jun 12, 2008 at 3:45 PM, Bernard Li wrote: > Hi all: > > I am sure most people have seen the following picture for Roadrunner > circulating the Net: > > > http://www.cnn.com/2008/TECH/06/09/fastest.computer.ap/index.html?iref=newssearch > > However, they don't look likes blades to me, more like 2U IBM x series > servers. Perhaps those are the I/O nodes? > > Cheers, > > Bernard > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080612/19388c7f/attachment.html From prentice at ias.edu Thu Jun 12 13:58:54 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture In-Reply-To: References: Message-ID: <48518E0E.8080604@ias.edu> Bernard Li wrote: > Hi all: > > I am sure most people have seen the following picture for Roadrunner > circulating the Net: > > http://www.cnn.com/2008/TECH/06/09/fastest.computer.ap/index.html?iref=newssearch > > However, they don't look likes blades to me, more like 2U IBM x series > servers. Perhaps those are the I/O nodes? > This might be what your seeing: "Each CU also has access to the Panasas file system through twelve System x3755 machines." - http://en.wikipedia.org/wiki/IBM_Roadrunner From richard.walsh at comcast.net Thu Jun 12 14:05:09 2008 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] Roadrunner picture Message-ID: <061220082105.9640.48518F850002D8B0000025A82215586394089C040E99D20B9D0E080C079D@comcast.net> Skipped content of type multipart/alternative-------------- next part -------------- An embedded message was scrubbed... From: "Peter St. John" Subject: Re: [Beowulf] Roadrunner picture Date: Thu, 12 Jun 2008 20:16:19 +0000 Size: 762 Url: http://www.scyld.com/pipermail/beowulf/attachments/20080612/d2cdc94e/attachment.mht From jan.heichler at gmx.net Thu Jun 12 14:27:56 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] MVAPICH2 and osu_latency Message-ID: <6410235104.20080612232756@gmx.net> Dear all! I found this http://mvapich.cse.ohio-state.edu/performance/mvapich2/opteron/MVAPICH2-opteron-gen2-DDR.shtml as reference value for MPI-latency of Infiniband. I try to reproduce those numbers at the moment but i'm stuck with # OSU MPI Latency Test v3.0 # Size Latency (us) 0 3.07 1 3.17 2 3.16 4 3.15 8 3.19 Equipment is two quadsocket Opteron Blades (Supermicro) with Mellanox Ex DDR cards. Single 24 port switch connects them. Can anybody help with suggestions what i can do to lower the latency? Regards, Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080612/999953c2/attachment.html From tom.elken at qlogic.com Thu Jun 12 15:04:29 2008 From: tom.elken at qlogic.com (Tom Elken) Date: Thu Aug 28 01:07:09 2008 Subject: [Beowulf] MVAPICH2 and osu_latency In-Reply-To: <6410235104.20080612232756@gmx.net> References: <6410235104.20080612232756@gmx.net> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0214C01D@AVEXCH1.qlogic.org> So you're concerned with the gap between the 2.63 us that OSU measured and your 3.07 us you measured. I wouldn't be too concerned. MPI latency can be quite dependent on the systems you use. OSU used dual-processor 2.8 Ghz processors. Such as system has ~60 ns latency to local memory. On your 4-socket Opteron system, your local memory latency is probably in the 90-100 ns range. Assuming you are also using MVAPICH2, this is probably the main difference for the latency shortfall you are seeing. Another possibility is that the CPU you are running the MPI test on is not the closest CPU to the PCIe chipset. Thus, you may be taking some HT hops on the way to the PCIe bus and adapter card. -Tom ________________________________ From: beowulf-bounces@beowulf.org [mailto:beo