From mathog at caltech.edu Fri Aug 1 15:13:39 2014 From: mathog at caltech.edu (mathog) Date: Fri, 01 Aug 2014 15:13:39 -0700 Subject: [Beowulf] =?utf-8?q?LSI_Megaraid_stalls_system_on_very_high_IO=3F?= In-Reply-To: References: Message-ID: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> On 31-Jul-2014 12:00, beowulf-request at beowulf.org wrote: > > Switch them to deadline or noop. > > echo deadline > /sys/block/sda/queue/scheduler > > rinse and repeat for other devices. That may fix the problem but I am still curious about exactly what happened during the terminal stalls. Using the current cfq scheduler what is going on in the system that is keeping those terminal processes from getting any CPU time? These wouldn't respond to a carriage return or an X11 redraw event, and that interaction should have been entirely within bash and other programs already in memory, so there should not be any contention for disk IO there. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Fri Aug 1 19:42:09 2014 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2014 22:42:09 -0400 (EDT) Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> Message-ID: > event, and that interaction should have been entirely within bash and other > programs already in memory, so there should not be any contention for disk IO > there. isn't that rather hard to tell? shared library page faulted in, for instance. any swapins happening? it could be there's some unfortunate path involving memory contention (since doing an IO benchmark is a good workout for memory management and scavenging.) (I guess you could reduce the chances of the latter by running the benchmark with O_DIRECT.) regards, mark hahn. From samuel at unimelb.edu.au Fri Aug 1 20:17:57 2014 From: samuel at unimelb.edu.au (Chris Samuel) Date: Sat, 02 Aug 2014 13:17:57 +1000 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> Message-ID: <4589120.sE9eJnx7Bk@quad> On Fri, 1 Aug 2014 03:13:39 PM mathog wrote: > That may fix the problem but I am still curious about exactly what > happened during the terminal stalls. Using the current cfq scheduler > what is going on in the system that is keeping those terminal processes > from getting any CPU time? Not entirely sure, but you may get some idea with latencytop *if* it is able to get some time to run to return info. cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From hearnsj at googlemail.com Fri Aug 1 23:10:21 2014 From: hearnsj at googlemail.com (John Hearns) Date: Sat, 2 Aug 2014 07:10:21 +0100 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: <4589120.sE9eJnx7Bk@quad> References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> <4589120.sE9eJnx7Bk@quad> Message-ID: In addition to the suggestions above I would look at the memory usage pattern while the IO is going on watch cat /proc/meminfo This system might have a decent amount of memory - but are the sysctl tunings for the dirty buffer sizes set small or something? On 2 August 2014 04:17, Chris Samuel wrote: > On Fri, 1 Aug 2014 03:13:39 PM mathog wrote: > > > That may fix the problem but I am still curious about exactly what > > happened during the terminal stalls. Using the current cfq scheduler > > what is going on in the system that is keeping those terminal processes > > from getting any CPU time? > > Not entirely sure, but you may get some idea with latencytop *if* it is > able > to get some time to run to return info. > > cheers, > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Sat Aug 2 08:14:40 2014 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 2 Aug 2014 11:14:40 -0400 (EDT) Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> <4589120.sE9eJnx7Bk@quad> Message-ID: > This system might have a decent amount of memory - but are the sysctl > tunings for the dirty buffer sizes set small or something? actually, my experience is the opposite: if sysctls permit large accumulation of dirty buffers, a system with big memory and slow write speeds will feel unpleasantly choppy. basically, I think you should set vm.dirty_ratio less than the amount your storage system can commit to disk in O(few seconds). that's the synchronous form of writeback - I also like to make the async form more active as well. (lower vm.dirty_background_ratio and vm.dirty_writeback_centisecs=100). of course, this is only relevant to big-memory-slow-disk machines. (but consider a hypothetical 1TB box with a single disk and default settings: if it dirties 100G, the sync writeback will kick in and saturate the disk for up to 10 minutes...) regards, mark hahn. From hearnsj at googlemail.com Mon Aug 4 01:24:49 2014 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 4 Aug 2014 09:24:49 +0100 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> <4589120.sE9eJnx7Bk@quad> Message-ID: Mark, you are of course correct. Flush often and flush early! As an aside, working with desktop systems with larger amounts of memory I would adjust the 'swappiness' tunable and also the min_free_kbytes. Min_free_kbytes in Linux is by default set very low for modern high memory systems. I had systems with 128Gbytes of RAM which would lock up in a similar fashion as you describe. Setting higher min_free_kbytes helped with the 'system paging itself into the deck' type of behaviour. See: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html On 2 August 2014 16:14, Mark Hahn wrote: > This system might have a decent amount of memory - but are the sysctl >> tunings for the dirty buffer sizes set small or something? >> > > actually, my experience is the opposite: if sysctls permit large > accumulation of dirty buffers, a system with big memory and slow > write speeds will feel unpleasantly choppy. basically, I think you should > set vm.dirty_ratio less than the amount your storage system can commit to > disk in O(few seconds). that's the synchronous form of writeback - I also > like to make the async form more active as well. > (lower vm.dirty_background_ratio and vm.dirty_writeback_centisecs=100). > > of course, this is only relevant to big-memory-slow-disk machines. > (but consider a hypothetical 1TB box with a single disk and default > settings: if it dirties 100G, the sync writeback will kick in and saturate > the disk for up to 10 minutes...) > > regards, mark hahn. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaquilina at eagleeyet.net Mon Aug 4 08:56:29 2014 From: jaquilina at eagleeyet.net (jaquilina) Date: Mon, 04 Aug 2014 17:56:29 +0200 Subject: [Beowulf] distcc Message-ID: <88221fa6452454a3f2676caab430e858@eagleeyet.net> I seem to have lost the thread about this discussion. My apologies about reviving this old thread, but what is the difference between distcc and icecream? -- Regards, Jonathan Aquilina Founder Eagle Eye T From hahn at mcmaster.ca Mon Aug 4 09:43:09 2014 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 4 Aug 2014 12:43:09 -0400 (EDT) Subject: [Beowulf] distcc In-Reply-To: <88221fa6452454a3f2676caab430e858@eagleeyet.net> References: <88221fa6452454a3f2676caab430e858@eagleeyet.net> Message-ID: > My apologies about reviving this old thread, but what is the difference > between distcc and icecream? says it pretty clearly: https://en.opensuse.org/Icecream From skylar.thompson at gmail.com Tue Aug 5 20:49:01 2014 From: skylar.thompson at gmail.com (Skylar Thompson) Date: Tue, 05 Aug 2014 20:49:01 -0700 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: References: <925442474713105c68da1937d55920a5@saf.bio.caltech.edu> <4589120.sE9eJnx7Bk@quad> Message-ID: <53E1A5AD.9@gmail.com> On 08/04/2014 01:24 AM, John Hearns wrote: > Mark, you are of course correct. > Flush often and flush early! > > As an aside, working with desktop systems with larger amounts of memory > I would adjust the 'swappiness' tunable > and also the min_free_kbytes. > Min_free_kbytes in Linux is by default set very low for modern high > memory systems. > I had systems with 128Gbytes of RAM which would lock up in a similar > fashion as you describe. Setting higher min_free_kbytes helped with the > 'system paging itself into the deck' type of behaviour. > See: > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html Agreed, and even more so when you have systems with hundreds of gigabytes of RAM whose dirty buffers are backed by a single NFS server with a fraction of that RAM for write cache. Skylar From hahn at mcmaster.ca Wed Aug 6 13:26:12 2014 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 6 Aug 2014 16:26:12 -0400 (EDT) Subject: [Beowulf] interesting paper re power efficiency Message-ID: http://arxiv.org/pdf/1407.8116v1.pdf in short, sub-threshold leakage is strongly temperature-dependent, so you can make conventional systems run more efficiently just by running them cold, at lower voltage and higher clock. I imagine this must be pretty strongly dependent on the details of the chip: fab-related parameters like gate dimensions, whether it's bulk or SOI, etc. paper mentions warm-water cooling, which is indeed an interesting angle. aquasar, for instance, is mainly aimed at reducing the delta-t (though lower thermal resistance) while keeping chips near their max operating temperature. 65 degree water is somewhat useful, but the GPU paper shows gflops/w still improving at 30 (which would correspond with an outgoing water temp of probably around 20.) I guess novec is available at various boiling points, for people who want to use immersion (though one could also manipulate pressure, I suppose...) regards, mark hahn. From eugen at leitl.org Thu Aug 7 07:51:29 2014 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 7 Aug 2014 16:51:29 +0200 Subject: [Beowulf] CUDA course at HLRS, Oct. 22-24, 2014 Message-ID: <20140807145129.GU26986@leitl.org> ----- Forwarded message from Rolf Rabenseifner ----- Date: Thu, 7 Aug 2014 15:55:43 +0200 (CEST) From: Rolf Rabenseifner To: eugen at leitl.org Subject: CUDA course at HLRS, Oct. 22-24, 2014 Reply-To: Rolf Rabenseifner , Gabi Kallenberger Message-Id: <20140807135543.47E5889C55 at awsrr.hlrs.de> Dear Sir, dear Madam / Sehr geehrte Dame, sehr geehrter Herr, Please, can you pass this course announcement also to interested colleagues. / Es waere schoen, wenn Sie diese Ankuendigung auch an interessierte Kollegen weitergeben koennten. http://www.hlrs.de/training/2014/CUDA2 Kind regards / Mit freundlichen Gruessen Rolf Rabenseifner and Gabi Kallenberger PS: Other course in the same week: Oct. 20-21, Scientific Visualization ====================================================================== Call for Participation ====================================================================== GPU Programming using CUDA -------------------------- Wednesday-Friday, Oct. 22-24, 2014 HLRS University of Stuttgart Germany Abstract: The course provides an introduction to the programming language CUDA which is used to write fast numeric algorithms for NVIDIA graphics processors (GPUs). Focus is on the basic usage of the language, the exploitation of the most important features of the device (massive parallel computation, shared memory, texture memory) and efficient usage of the hardware to maximize performance. An overview of the available development tools and the advanced features of the language is given. Date & Location: Oct. 22, 12:30 - Oct. 24, 16:00 (Course 2014-CUDA2) HLRS, seminar room, Allmandring 30, 70569 Stuttgart, Germany. Registration and further information: http://www.hlrs.de/training/2014/CUDA2 Deadline for registration: Sep. 21, 2014 (Course 2014-CUDA2) Registration fee: Students without Diploma/Master: 30 EUR Students with Diploma/Master (PhD students) at German universities: 60 EUR Members of German universities and public research institutes: 60 EUR Members of universities and public research institutes within Europe or PRACE: 60 EUR Members of other universities and public research institutes: 120 EUR others: 400 EUR (includes food and drink at coffee breaks, will be collected on the first day of the course, cash only) Lecturer: Amer Wafai and Thomas Baumann, HLRS The course language is English. We will provide local systems with test accounts for the participants. Further information on CUDA: http://en.wikipedia.org/wiki/CUDA Further courses can be found at http://www.hlrs.de/training/course-list --------------------------------------------------------------------- I would appreciate if you could forward this email to interested colleagues. If you receive double postings or you want to stop these postings, then please reply the email-address(es) that should be unsubscribed, and include "unsubscribe" in the subject line. --------------------------------------------------------------------- --------------------------------------------------------------------- Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseifner at hlrs.de Gabi Kallenberger .. . . . . . . . . . . . email kallenberger at hlrs.de High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832 Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner Nobelstr. 19, D-70569 Stuttgart, Germany . . . . (Office: Room 1.307) --------------------------------------------------------------------- ----- End forwarded message ----- From walid.shaari at gmail.com Thu Aug 7 10:48:27 2014 From: walid.shaari at gmail.com (Walid) Date: Thu, 7 Aug 2014 20:48:27 +0300 Subject: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" Message-ID: Hi, I am interested to see if any one is already using ELK for HPC related logs. i would like to know more about what metrics, information, reports they are generating and logstash patterns, filters, any groks used. I have just started working on this starting with the configuration management side of systems, however i see huge potential for people who can not afford to have splunk, I am thinking scheduler, interconnect, GPFS, provisioning, system logs, job tractability, metrics from accounting file, profiling, better failure management and visualisation. kind regards Walid -------------- next part -------------- An HTML attachment was scrubbed... URL: From hvidal at tesseract-tech.com Thu Aug 7 11:19:35 2014 From: hvidal at tesseract-tech.com (H. Vidal, Jr.) Date: Thu, 7 Aug 2014 14:19:35 -0400 Subject: [Beowulf] on powering off and its character Message-ID: All, Just curious if anyone here among those well-informed about the nature of electrical power might know about documented or canonical analysis of AC power as it's being turned off...... That is, is there existing and/or standard analysis or characterization of typical AC power as power is removed? Does it ring, or spike, or otherwise modulate in understood or at least studied ways? Will google it as well, one of our engineers tried GIYF but came up short. Thanks. H. Vidal, Jr. Tesseract Technology From peter.st.john at gmail.com Thu Aug 7 14:00:36 2014 From: peter.st.john at gmail.com (Peter St. John) Date: Thu, 7 Aug 2014 17:00:36 -0400 Subject: [Beowulf] on powering off and its character In-Reply-To: References: Message-ID: There are EE professionals on this list, and I"m certainly not one of them (little tiny billiard balls in narrow pipes?) but I just want to say that it can be worthwhile googling even after someone else gives up, because the second pair of eyes can stumble on a better keyword or other regexp. So I came up with http://lifehacker.com/5526542/how-much-battery-life-does-sleep-mode-really-drain which at least is on the topic, but not for the equipment you probably have in mind. Peter On Thu, Aug 7, 2014 at 2:19 PM, H. Vidal, Jr. wrote: > All, > Just curious if anyone here among those well-informed about the nature > of electrical power might know about documented or canonical analysis > of AC power as it's being turned off...... > > That is, is there existing and/or standard analysis or characterization of > typical AC power as power is removed? Does it ring, or spike, or otherwise > modulate in understood or at least studied ways? > > Will google it as well, one of our engineers tried GIYF but came up short. > > Thanks. > > H. Vidal, Jr. > Tesseract Technology > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevin at kevino.org Thu Aug 7 12:35:43 2014 From: kevin at kevino.org (KevinO) Date: Thu, 07 Aug 2014 12:35:43 -0700 Subject: [Beowulf] on powering off and its character In-Reply-To: References: Message-ID: <53E3D50F.7040802@kevino.org> On 08/07/2014 11:19 AM, H. Vidal, Jr. wrote: > All, > Just curious if anyone here among those well-informed about the nature > of electrical power might know about documented or canonical analysis > of AC power as it's being turned off...... > > That is, is there existing and/or standard analysis or characterization of > typical AC power as power is removed? Does it ring, or spike, or otherwise > modulate in understood or at least studied ways? > This is very well understood, studied, and documented for well over a century, by Engineers and Physicists. Any shift in the amplitude (voltage) due to a circuit being opened is due to the non-zero and complex impedance of the circuit supplying the power and the load. (The voltage will not jump 'just because') Maxwell's equations provide an accurate way to analyze to relationship of the current through an inductor to both the value of the inductor (the inductance) and the rate of change of the voltage across the inductor. (and the voltage across an inductor as it relates to the amount of inductance and the rate-of-change of the current). In a nutshell: When the rate-of-change of the current through an inductor approaches infinity (as when a switch in series is opened) the voltage across the inductor will approach infinity. (This is limited by circuit capacitance and resistance that can absorb some of the energy of the 'spike' caused by the rapid collapse of the magnetic field of the inductor, or by arcing and high voltage breakdown of the insulators in the circuit) Simple wires have a small amount of inductance. Coils and transformers have a lot. RC circuits (composed of series combinations of resistors and capacitors) are often placed across switch contacts or coil windings to absorb this energy in a controlled way. -- KevinO From james.p.lux at jpl.nasa.gov Thu Aug 7 21:09:35 2014 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 8 Aug 2014 04:09:35 +0000 Subject: [Beowulf] on powering off and its character In-Reply-To: References: Message-ID: Are you wondering about power turn on/off locally, or when the utility shuts you down? And are you interested in the voltage/current upstream or downstream of the switch? The power lines (of whatever length) are a transmission line with distributed R, L, and C, and it?s carrying an AC current, so when you change something, there?s a transient that propagates away from the change. On the power lines from Northern California to Southern California, for instance, a switching transient can take more than 8 hours to die out, as it propagates around, bouncing back and forth. James Lux, P.E. Task Manager, FINDER ? Finding Individuals for Disaster and Emergency Response Co-Principal Investigator, SCaN Testbed (n?e CoNNeCT) Project Jet Propulsion Laboratory 4800 Oak Grove Drive, MS 161-213 Pasadena CA 91109 +1(818)354-2075 +1(818)395-2714 (cell) On 8/7/14, 11:19 AM, "H. Vidal, Jr." wrote: >All, > Just curious if anyone here among those well-informed about the nature >of electrical power might know about documented or canonical analysis >of AC power as it's being turned off...... > > That is, is there existing and/or standard analysis or characterization >of >typical AC power as power is removed? Does it ring, or spike, or otherwise >modulate in understood or at least studied ways? > > Will google it as well, one of our engineers tried GIYF but came up >short. > >Thanks. > >H. Vidal, Jr. >Tesseract Technology > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Fri Aug 8 12:53:16 2014 From: mathog at caltech.edu (mathog) Date: Fri, 08 Aug 2014 12:53:16 -0700 Subject: [Beowulf] on powering off and its character In-Reply-To: References: Message-ID: <89fbc2dc27dcc73b960f43a80237b506@saf.bio.caltech.edu> On 08-Aug-2014 12:00, Lux, Jim (337C) wrote: > On the power lines from Northern California to Southern California, for > instance, a switching transient can take more than 8 hours to die out, > as > it propagates around, bouncing back and forth. Large water lines have somewhat similar issues. It took them 3 hours to shut off the recent water main break near UCLA not because they couldn't turn the shutoff valve any faster, but because if they had done so the resulting pressure spikes going back up the pipe would have blown up other water lines. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From matthurd at acm.org Sat Aug 9 17:16:49 2014 From: matthurd at acm.org (Matt Hurd) Date: Sat, 9 Aug 2014 19:16:49 -0500 Subject: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" In-Reply-To: References: Message-ID: Can it do better than millisecond time stamps yet? My network stuff needs nanoseconds or better... --Matt. On 7 August 2014 12:48, Walid wrote: > Hi, > > I am interested to see if any one is already using ELK for HPC related > logs. i would like to know more about what metrics, information, reports > they are generating and logstash patterns, filters, any groks used. I have > just started working on this starting with the configuration management > side of systems, however i see huge potential for people who can not afford > to have splunk, I am thinking scheduler, interconnect, GPFS, provisioning, > system logs, job tractability, metrics from accounting file, profiling, > better failure management and visualisation. > > kind regards > > Walid > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Sat Aug 9 21:19:32 2014 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sun, 10 Aug 2014 04:19:32 +0000 Subject: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" In-Reply-To: References: Message-ID: Nanoseconds? You need something like a GPS (run of the mill is good to 10-50 ns) or IEEE-1588 timing distribution (Precision Time Protocol) If you have accurate 1pps to appropriate hardware on your nodes, then getting (much better) than sub-microsecond timing is more a matter of software than the hardware. James Lux, P.E. Task Manager, FINDER ? Finding Individuals for Disaster and Emergency Response Co-Principal Investigator, SCaN Testbed (n?e CoNNeCT) Project Jet Propulsion Laboratory 4800 Oak Grove Drive, MS 161-213 Pasadena CA 91109 +1(818)354-2075 +1(818)395-2714 (cell) From: Matt Hurd > Date: Saturday, August 9, 2014 at 5:16 PM To: Walid > Cc: "beowulf at beowulf.org" > Subject: Re: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" Can it do better than millisecond time stamps yet? My network stuff needs nanoseconds or better... --Matt. On 7 August 2014 12:48, Walid > wrote: Hi, I am interested to see if any one is already using ELK for HPC related logs. i would like to know more about what metrics, information, reports they are generating and logstash patterns, filters, any groks used. I have just started working on this starting with the configuration management side of systems, however i see huge potential for people who can not afford to have splunk, I am thinking scheduler, interconnect, GPFS, provisioning, system logs, job tractability, metrics from accounting file, profiling, better failure management and visualisation. kind regards Walid _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthurd at acm.org Sat Aug 9 23:07:38 2014 From: matthurd at acm.org (Matt Hurd) Date: Sun, 10 Aug 2014 16:07:38 +1000 Subject: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" In-Reply-To: References: Message-ID: Apologies, you may have misunderstood. I have nanoseconds via network captures (in pcap-ng etc) but the ELK stack AFAIK only copes with milliseconds which makes it problematic. ELK looks lovely, but without nanosecond support it doesn't work for many people with similar needs to myself. Though I may be missing something and it has nanosecond support hidden in there somewhere which is what I was asking about. Kind regards, --Matt. On 10 August 2014 14:19, Lux, Jim (337C) wrote: > Nanoseconds? > You need something like a GPS (run of the mill is good to 10-50 ns) or > IEEE-1588 timing distribution (Precision Time Protocol) > > If you have accurate 1pps to appropriate hardware on your nodes, then > getting (much better) than sub-microsecond timing is more a matter of > software than the hardware. > > James Lux, P.E. > > Task Manager, FINDER ? Finding Individuals for Disaster and Emergency > Response > > Co-Principal Investigator, SCaN Testbed (*n?e* CoNNeCT) Project > > Jet Propulsion Laboratory > > 4800 Oak Grove Drive, MS 161-213 > > Pasadena CA 91109 > > +1(818)354-2075 > > +1(818)395-2714 (cell) > > > > From: Matt Hurd > Date: Saturday, August 9, 2014 at 5:16 PM > To: Walid > Cc: "beowulf at beowulf.org" > Subject: Re: [Beowulf] ELK query "Elasticsearch, logstash and Kibana" > > Can it do better than millisecond time stamps yet? My network stuff > needs nanoseconds or better... > > --Matt. > > > On 7 August 2014 12:48, Walid wrote: > >> Hi, >> >> I am interested to see if any one is already using ELK for HPC related >> logs. i would like to know more about what metrics, information, reports >> they are generating and logstash patterns, filters, any groks used. I have >> just started working on this starting with the configuration management >> side of systems, however i see huge potential for people who can not afford >> to have splunk, I am thinking scheduler, interconnect, GPFS, provisioning, >> system logs, job tractability, metrics from accounting file, profiling, >> better failure management and visualisation. >> >> kind regards >> >> Walid >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at cse.ucdavis.edu Tue Aug 12 00:05:02 2014 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 12 Aug 2014 00:05:02 -0700 Subject: [Beowulf] Nvidia K1 Denver In-Reply-To: References: Message-ID: <53E9BC9E.6050308@cse.ucdavis.edu> I was surprised to find the Nvidia K1 to be a surprising departure from the ARM Cortex a53 and a57 cores. Summary at: http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/ Details at (if you are willing to share your email address): http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/ Highlights: * The nvidia 64-bit denver core is in order! * dynamics code optimizer uses 128MB of ram to optimize frequently used code segments * Larger 128KB L1 I cache to handle microcode expansion * L1 I cache can deliver a 32 byte parcel to the scheduler every cycle * Optimizer is not visible to OS or hypervisor * 7 way issue * 13 cycle mispredict. * expected launch at 2.5 GHz * Slightly better than haswell celeron 2955U at SpecINT 2k * Slightly worse than haswell celeron 2955U at SpecFP 2k * new lower power state CC4 that allows maintaining cache and CPU state information that looks to be around 5mw from the graph. * special optimizations lookup table, 1k entries, jumps to the already optimized code. * 128MB cache does not contain any pre-canned optimizations for benchmarks. *chuckle* * Pin compatible with existing 32 bit K1 chips. The dynamics code optimization: * collects branch results (taken, not taken, strongly take, and strongly not taken). * performs register renaming * claimed comparable performance to OoO hardware implementations * claimed power efficiency of in order implementations * can reorder load/stores * remove redundant code * hoist redundant computations * unroll loops * claims larger instruction reorder window than hardware implementations Seems like the optimizer has a pretty tough job considering that compilers have already attempted similar optimizations with access to source code and relatively unlimited CPU/ram resources compared to a battery operated tablet/phone/widget. For reference, celeron 2955U = 2 cores @ 1.4GHz, 2MB cache, 15 watt TDP, haswell core, 25.6GB/sec mem bandwidth. From doug.lattman at L-3com.com Tue Aug 12 09:01:58 2014 From: doug.lattman at L-3com.com (doug.lattman at L-3com.com) Date: Tue, 12 Aug 2014 16:01:58 +0000 Subject: [Beowulf] 8p 16 core x86_64 systems Message-ID: <604CF06B57C0DD4391CCB99DF7F36A4008A012@EXCHANGE.CAC.L-3com.com> Does anyone know of any manufactures who build an 8 processor (8-way) motherboard which can utilize 16 core opteron chips? Tyan use to build the Transport VX50(B4985-E); however, it uses the older chip set which only goes up to 4 cores per processor. Supermicro makes a 4p with 16 cores each for a total of 64cores per board. Just trying to reduce my foot print for all my cores in my cluster. Was hoping to find some manufacture who has built a 8p with 16 cores per p, yielding 128 cores per board, if not more. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua_mora at usa.net Tue Aug 12 09:57:55 2014 From: joshua_mora at usa.net (Joshua Mora) Date: Tue, 12 Aug 2014 09:57:55 -0700 Subject: [Beowulf] 8p 16 core x86_64 systems Message-ID: <237sHLq543040S06.1407862675@web06.cms.usa.net> Hello Doug. AMD CPUs are connected together with a coherent fabric named Hypertransport. It allows to build shared memory system with coherency upto 8 numanodes. Currently, AMD CPUs have 1 numanode or 2 numanodes. The first is in 1 package/socket named C32 and the second in a 1 package/socket named G34. The package G34 has 2 numanodes connected with Hypertransport. That is the way you can get in a single package upto 16 cores. Hypertransport technology allows only upto 8 numanodes. Therefore you can have mother boards that connect upto 8 C32 sockets or 4 G34 sockets. Hypertransport consortium has designed extensions to overcome this limit, named high (numa) node count. Adding below a link to the specification. http://www.hypertransport.org/default.cfm?page=HighNodeCountSpecification Without the intend to flood this list with marketing BS: Numascale, a company in Europe (I can put you in contact with them) has built an ASIC that allows you to build larger shared memory + coherency beyond the Hypertransport limit of 8 numanodes. You can buy then 4 G34 socket systems and connect them with this ASIC and cables to build a larger system. Adding below a link to their technology. http://www.numascale.com Best regards, Joshua Mora. ------ Original Message ------ Received: 09:02 AM PDT, 08/12/2014 From: To: Subject: [Beowulf] 8p 16 core x86_64 systems > Does anyone know of any manufactures who build an 8 processor (8-way) motherboard which can utilize 16 core opteron chips? > Tyan use to build the Transport VX50(B4985-E); however, it uses the older chip set which only goes up to 4 cores per processor. > Supermicro makes a 4p with 16 cores each for a total of 64cores per board. > Just trying to reduce my foot print for all my cores in my cluster. Was hoping to find some manufacture who has built a 8p with 16 cores per p, yielding 128 cores per board, if not more. > Thanks > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From cbergstrom at pathscale.com Tue Aug 12 10:50:31 2014 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Wed, 13 Aug 2014 00:50:31 +0700 Subject: [Beowulf] 8p 16 core x86_64 systems In-Reply-To: <237sHLq543040S06.1407862675@web06.cms.usa.net> References: <237sHLq543040S06.1407862675@web06.cms.usa.net> Message-ID: <53EA53E7.3000408@pathscale.com> On 08/12/14 11:57 PM, Joshua Mora wrote: > Hello Doug. > AMD CPUs are connected together with a coherent fabric named Hypertransport. > It allows to build shared memory system with coherency upto 8 numanodes. > Currently, AMD CPUs have 1 numanode or 2 numanodes. The first is in 1 > package/socket named C32 and the second in a 1 package/socket named G34. > The package G34 has 2 numanodes connected with Hypertransport. > That is the way you can get in a single package upto 16 cores. > Hypertransport technology allows only upto 8 numanodes. > Therefore you can have mother boards that connect upto 8 C32 sockets or 4 G34 > sockets. > Hypertransport consortium has designed extensions to overcome this limit, > named high (numa) node count. > Adding below a link to the specification. > http://www.hypertransport.org/default.cfm?page=HighNodeCountSpecification > > Without the intend to flood this list with marketing BS: Not taking away from any of the cool things which you can do with this.. I suspect the OP was trying to increase his density. What you mentioned would require 2 motherboards and also an interconnect between them. I believe Cray may have a high density blade setup, but I'm not sure if there's anything like this for mere mortals. From joshua_mora at usa.net Tue Aug 12 11:10:30 2014 From: joshua_mora at usa.net (Joshua Mora) Date: Tue, 12 Aug 2014 11:10:30 -0700 Subject: [Beowulf] 8p 16 core x86_64 systems Message-ID: <558sHLsJE2480S02.1407867030@web02.cms.usa.net> Certainly my assumption/interpretation has been "what is the availability of the cheapest and largest SMP solution with full coherence in hardware that you can build". Notice I mention hardware based coherence since there are software based solutions available as well. If you need just plenty of cores at the highest core count density that you can get with small memory footprint per OS_instance/core(for instance, for consolidation/virtualization reasons), then you do not need coherence and a much wider range of solutions are available that can use other interconnects/fabrics. Joshua ------ Original Message ------ Received: 10:53 AM PDT, 08/12/2014 From: "C. Bergstr?m" To: Joshua Mora Cc: doug.lattman at L-3com.com, beowulf at beowulf.org Subject: Re: [Beowulf] 8p 16 core x86_64 systems > On 08/12/14 11:57 PM, Joshua Mora wrote: > > Hello Doug. > > AMD CPUs are connected together with a coherent fabric named Hypertransport. > > It allows to build shared memory system with coherency upto 8 numanodes. > > Currently, AMD CPUs have 1 numanode or 2 numanodes. The first is in 1 > > package/socket named C32 and the second in a 1 package/socket named G34. > > The package G34 has 2 numanodes connected with Hypertransport. > > That is the way you can get in a single package upto 16 cores. > > Hypertransport technology allows only upto 8 numanodes. > > Therefore you can have mother boards that connect upto 8 C32 sockets or 4 G34 > > sockets. > > Hypertransport consortium has designed extensions to overcome this limit, > > named high (numa) node count. > > Adding below a link to the specification. > > http://www.hypertransport.org/default.cfm?page=HighNodeCountSpecification > > > > Without the intend to flood this list with marketing BS: > Not taking away from any of the cool things which you can do with this.. > I suspect the OP was trying to increase his density. What you mentioned > would require 2 motherboards and also an interconnect between them. > > I believe Cray may have a high density blade setup, but I'm not sure if > there's anything like this for mere mortals. > From cbergstrom at pathscale.com Tue Aug 12 11:13:31 2014 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Wed, 13 Aug 2014 01:13:31 +0700 Subject: [Beowulf] 8p 16 core x86_64 systems In-Reply-To: <558sHLsJE2480S02.1407867030@web02.cms.usa.net> References: <558sHLsJE2480S02.1407867030@web02.cms.usa.net> Message-ID: <53EA594B.7050808@pathscale.com> On 08/13/14 01:10 AM, Joshua Mora wrote: > Certainly my assumption/interpretation has been "what is the availability of > the cheapest and largest SMP solution with full coherence in hardware that you > can build". > Notice I mention hardware based coherence since there are software based > solutions available as well. > If you need just plenty of cores at the highest core count density that you > can get with small memory footprint per OS_instance/core(for instance, for > consolidation/virtualization reasons), then you do not need coherence and a > much wider range of solutions are available that can use other > interconnects/fabrics. Is consolidation/virtualization really applicable at all to HPC? I thought virtualbox/vmware and all other friends don't allow direct access to the x86 extensions. Disabling this, the overhead or any layer on top I think would cause unacceptable levels of performance hit. (Someone please correct me if I'm wrong) The closest thing I can think which would play nice would be Solaris zones. (Which allow fine grained control of resources, but don't hide/limit the hw level capability) I'm curious to see what Doug had in mind.. From doug.lattman at L-3com.com Tue Aug 12 11:35:42 2014 From: doug.lattman at L-3com.com (doug.lattman at L-3com.com) Date: Tue, 12 Aug 2014 18:35:42 +0000 Subject: [Beowulf] 8p 16 core x86_64 systems In-Reply-To: <53EA594B.7050808@pathscale.com> References: <558sHLsJE2480S02.1407867030@web02.cms.usa.net> <53EA594B.7050808@pathscale.com> Message-ID: <604CF06B57C0DD4391CCB99DF7F36A4008A0A8@EXCHANGE.CAC.L-3com.com> I have several 100 processes I send to all the cores via mpi from a simulation. Presently, we have a fleet of computers to make up a boat load cores per computer. VMware adds a layer of abstraction, which does not get me anything. I need actual cores, running the same OS (netboot and shares via nfs). The head node sends out to all cores through the mpi interface and all the nodes are transparent to the end user. All the computers which net boot are diskless; so if I go from 4 processors to 8 processors and go from 2 u to 4u; I don't mind doubling my U to get 128 cores. At that point the energy savings, speed gain, reduced network traffic, memory efficiency increases; etc are huge. The 64 core computer which replaced an 16 core computer is huge savings in memory overhead alone. The energy savings is also huge. We are now powering 64cores with the same power we powered 16cores. My recollection was there was a HT connector which would allow us to bundle two motherboards to allow 8 cores to work together under one Redhat OS. Apparently, if I get the right numa chip and connectors I can get many more than 8 to tie more boards together? -----Original Message----- From: "C. Bergstr?m" [mailto:cbergstrom at pathscale.com] Sent: Tuesday, August 12, 2014 2:14 PM To: Joshua Mora Cc: Lattman, Doug @ SSG - CAC; beowulf at beowulf.org Subject: Re: [Beowulf] 8p 16 core x86_64 systems On 08/13/14 01:10 AM, Joshua Mora wrote: > Certainly my assumption/interpretation has been "what is the > availability of the cheapest and largest SMP solution with full > coherence in hardware that you can build". > Notice I mention hardware based coherence since there are software > based solutions available as well. > If you need just plenty of cores at the highest core count density > that you can get with small memory footprint per OS_instance/core(for > instance, for consolidation/virtualization reasons), then you do not > need coherence and a much wider range of solutions are available that > can use other interconnects/fabrics. Is consolidation/virtualization really applicable at all to HPC? I thought virtualbox/vmware and all other friends don't allow direct access to the x86 extensions. Disabling this, the overhead or any layer on top I think would cause unacceptable levels of performance hit. (Someone please correct me if I'm wrong) The closest thing I can think which would play nice would be Solaris zones. (Which allow fine grained control of resources, but don't hide/limit the hw level capability) I'm curious to see what Doug had in mind.. From dimitrisz at gmail.com Fri Aug 15 09:58:27 2014 From: dimitrisz at gmail.com (Dimitris Zilaskos) Date: Fri, 15 Aug 2014 19:58:27 +0300 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: References: Message-ID: Hi, I hope your issue has been resolved meanwhile. I had a somehow similar mixed experience with Dell branded LSI controllers. It would appear that some models are just not fit for particular workloads. I have put some information in our blog at http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and-resolved/ Cheers, Dimitris On Thu, Jul 31, 2014 at 7:37 PM, mathog wrote: > Any pointers on why a system might appear to "stall" on very high IO through > an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.) > > I have been working on another group's big Dell server, which has 16 CPUs, > 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid (not sure > of the exact configuration and their system admin is out sick) and show up > as /dev/sda[abc], where the first two are just under 2 TB and the third is > /boot and is about 133 Gb. sda and sdb are then combined through lvm into > one big volume and that is what is mounted. > > Yesterday on this system when I ran 14 copies of this simultaneously: > > # X is 0-13 > gunzip -c bigfile${X}.gz > resultfile${X} > > the first time, part way through, all of my terminals locked up for several > minutes, and then recovered. Another similar command had the same issue > about half an hour later, but others between and since did not stall. The > size of the files unpacked is only about 0.5Gb, so even if the entire file > was stored in memory in the pipes all 14 should have fit in main memory. > Nothing else was running (at least that I noticed before or after, something > might have started up during the run and ended before I could look for it.) > During this period the system would still answer pings. Nothing showed up > in /var/log/messages or dmesg, "last" showed nobody else had logged in, and > overnight runs of "smartctl -t long" on the 5 disks were clean - nothing > pending, no reallocation events. > > Today ran the first set of commands again with "nice 10" and had "top" going > and nothing untoward was observed and there were no stalls. On that run > iostat showed: > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda 6034.00 0.00 529504.00 0 529504 > sda5 6034.00 0.00 529504.00 0 529504 > dm-0 68260.00 2056.00 546008.00 2056 546008 > > > So why the apparent stalls yesterday? It felt like either my interactive > processes were swapped out or they had a much lower priority than enough > other processes so that they were not getting any CPU time. Is there some > sort of housekeeping that the Megaraid, LVM, or anything normally installed > with RHEL 5.10, might need to do, from time to time, that would account for > these stalls? > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From j.sassmannshausen at ucl.ac.uk Sat Aug 16 00:46:23 2014 From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Sat, 16 Aug 2014 08:46:23 +0100 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: References: Message-ID: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> Hi all thanks for the thread which was very timely for me. Specially thanks Dimitris for your contribution. My problem: I got some old PCI-X LSI SCSI cards which are connected to some Infortrend storage boxes. We recently had a power-dip (lights went off and came back within 2 sec) and now the 10 year old frontend is playing up. So I need a new frontend and it seems very difficutl to get a PCI-e to PCI-X riser card so I can get a newer motherboard with more cores and more memory. Hence the thread was good for me to read as I hopefully can configure the frontend a bit better. If somebody got any comments on my problem feel free to reply. David: By the looks of it you will compress larger files on a regular base. Have you considered using the parallel version of gzip? Per default it is using all available cores but you can change that in the command line. That way you might avoid the problem with disc I/O and simply use the available cores. You also could do a 'nice' to make sure the machine does not become unresponsive due to high CPU load. Just an idea to speed up your decompressions. All the best from a sunny London J?rg On Freitag 15 August 2014 Dimitris Zilaskos wrote: > Hi, > > I hope your issue has been resolved meanwhile. I had a somehow similar > mixed experience with Dell branded LSI controllers. It would appear > that some models are just not fit for particular workloads. I have put > some information in our blog at > http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and- > resolved/ > > Cheers, > > Dimitris > > On Thu, Jul 31, 2014 at 7:37 PM, mathog wrote: > > Any pointers on why a system might appear to "stall" on very high IO > > through an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.) > > > > I have been working on another group's big Dell server, which has 16 > > CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid > > (not sure of the exact configuration and their system admin is out sick) > > and show up as /dev/sda[abc], where the first two are just under 2 TB > > and the third is /boot and is about 133 Gb. sda and sdb are then > > combined through lvm into one big volume and that is what is mounted. > > > > Yesterday on this system when I ran 14 copies of this simultaneously: > > # X is 0-13 > > gunzip -c bigfile${X}.gz > resultfile${X} > > > > the first time, part way through, all of my terminals locked up for > > several minutes, and then recovered. Another similar command had the > > same issue about half an hour later, but others between and since did > > not stall. The size of the files unpacked is only about 0.5Gb, so even > > if the entire file was stored in memory in the pipes all 14 should have > > fit in main memory. Nothing else was running (at least that I noticed > > before or after, something might have started up during the run and > > ended before I could look for it.) During this period the system would > > still answer pings. Nothing showed up in /var/log/messages or dmesg, > > "last" showed nobody else had logged in, and overnight runs of "smartctl > > -t long" on the 5 disks were clean - nothing pending, no reallocation > > events. > > > > Today ran the first set of commands again with "nice 10" and had "top" > > going and nothing untoward was observed and there were no stalls. On > > that run iostat showed: > > > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > > sda 6034.00 0.00 529504.00 0 529504 > > sda5 6034.00 0.00 529504.00 0 529504 > > dm-0 68260.00 2056.00 546008.00 2056 546008 > > > > > > So why the apparent stalls yesterday? It felt like either my interactive > > processes were swapped out or they had a much lower priority than enough > > other processes so that they were not getting any CPU time. Is there some > > sort of housekeeping that the Megaraid, LVM, or anything normally > > installed with RHEL 5.10, might need to do, from time to time, that > > would account for these stalls? > > > > Thanks, > > > > David Mathog > > mathog at caltech.edu > > Manager, Sequence Analysis Facility, Biology Division, Caltech > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- ************************************************************* Dr. J?rg Sa?mannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshausen at ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From greg.matthews at diamond.ac.uk Mon Aug 18 05:19:29 2014 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Mon, 18 Aug 2014 13:19:29 +0100 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> Message-ID: <53F1EF51.2050206@diamond.ac.uk> On 16/08/14 08:46, J?rg Sa?mannshausen wrote: > My problem: I got some old PCI-X LSI SCSI cards which are connected to some > Infortrend storage boxes. We recently had a power-dip (lights went off and came > back within 2 sec) and now the 10 year old frontend is playing up. So I need a > new frontend and it seems very difficutl to get a PCI-e to PCI-X riser card so I > can get a newer motherboard with more cores and more memory. good luck with that! Those technologies are pretty incompatible. There are one or two PCIe (x1) to PCI (maybe compatible with PCI-X - check voltages etc.) converters but I wouldn't trust them with my storage. The last server we bought that was still compatible with PCI-X was a Dell Poweredge R200, you needed to specify PCI-X riser when buying. Maybe ebay is your best bet at this point? GREG > > Hence the thread was good for me to read as I hopefully can configure the > frontend a bit better. > > If somebody got any comments on my problem feel free to reply. > > David: By the looks of it you will compress larger files on a regular base. > Have you considered using the parallel version of gzip? Per default it is > using all available cores but you can change that in the command line. That > way you might avoid the problem with disc I/O and simply use the available > cores. You also could do a 'nice' to make sure the machine does not become > unresponsive due to high CPU load. Just an idea to speed up your > decompressions. > > All the best from a sunny London > > J?rg > > > On Freitag 15 August 2014 Dimitris Zilaskos wrote: >> Hi, >> >> I hope your issue has been resolved meanwhile. I had a somehow similar >> mixed experience with Dell branded LSI controllers. It would appear >> that some models are just not fit for particular workloads. I have put >> some information in our blog at >> http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-and- >> resolved/ >> >> Cheers, >> >> Dimitris >> >> On Thu, Jul 31, 2014 at 7:37 PM, mathog wrote: >>> Any pointers on why a system might appear to "stall" on very high IO >>> through an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.) >>> >>> I have been working on another group's big Dell server, which has 16 >>> CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid >>> (not sure of the exact configuration and their system admin is out sick) >>> and show up as /dev/sda[abc], where the first two are just under 2 TB >>> and the third is /boot and is about 133 Gb. sda and sdb are then >>> combined through lvm into one big volume and that is what is mounted. >>> >>> Yesterday on this system when I ran 14 copies of this simultaneously: >>> # X is 0-13 >>> gunzip -c bigfile${X}.gz > resultfile${X} >>> >>> the first time, part way through, all of my terminals locked up for >>> several minutes, and then recovered. Another similar command had the >>> same issue about half an hour later, but others between and since did >>> not stall. The size of the files unpacked is only about 0.5Gb, so even >>> if the entire file was stored in memory in the pipes all 14 should have >>> fit in main memory. Nothing else was running (at least that I noticed >>> before or after, something might have started up during the run and >>> ended before I could look for it.) During this period the system would >>> still answer pings. Nothing showed up in /var/log/messages or dmesg, >>> "last" showed nobody else had logged in, and overnight runs of "smartctl >>> -t long" on the 5 disks were clean - nothing pending, no reallocation >>> events. >>> >>> Today ran the first set of commands again with "nice 10" and had "top" >>> going and nothing untoward was observed and there were no stalls. On >>> that run iostat showed: >>> >>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >>> sda 6034.00 0.00 529504.00 0 529504 >>> sda5 6034.00 0.00 529504.00 0 529504 >>> dm-0 68260.00 2056.00 546008.00 2056 546008 >>> >>> >>> So why the apparent stalls yesterday? It felt like either my interactive >>> processes were swapped out or they had a much lower priority than enough >>> other processes so that they were not getting any CPU time. Is there some >>> sort of housekeeping that the Megaraid, LVM, or anything normally >>> installed with RHEL 5.10, might need to do, from time to time, that >>> would account for these stalls? >>> >>> Thanks, >>> >>> David Mathog >>> mathog at caltech.edu >>> Manager, Sequence Analysis Facility, Biology Division, Caltech >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- Greg Matthews 01235 778658 Scientific Computing Group Leader Diamond Light Source Ltd. OXON UK -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From deadline at eadline.org Tue Aug 19 07:10:25 2014 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 19 Aug 2014 10:10:25 -0400 Subject: [Beowulf] Docker vs KVM paper by IBM Message-ID: I ran across this interesting paper by IBM: An Updated Performance Comparison of Virtual Machines and Linux Containers Nothing like hard numbers. I created a short article with links on Cluster Monkey: http://clustermonkey.net/Select-News/docker-versus-kvm-hard-numbers-for-hpc.html -- Doug -- Mailscanner: Clean From kilian.cavalotti.work at gmail.com Tue Aug 19 09:16:34 2014 From: kilian.cavalotti.work at gmail.com (Kilian Cavalotti) Date: Tue, 19 Aug 2014 09:16:34 -0700 Subject: [Beowulf] Docker vs KVM paper by IBM In-Reply-To: References: Message-ID: Hi all, On Tue, Aug 19, 2014 at 7:10 AM, Douglas Eadline wrote: > I ran across this interesting paper by IBM: > An Updated Performance Comparison of Virtual Machines and Linux Containers It's an interesting paper, but I kind of feel it's comparing apple to oranges. They're both round and tasty, but it's not really the same thing. There's probably no need to detail this, but KVM is a virtualization infrastructure that run full stack OSes (using their own kernels) on top of a Linux kernel turned into an hypervisor. So yes, it carries the overhead of running a kernel over a kernel, but also the flexibility of doing so (ie. you can run different kernel/OS versions on top of each other, use virtual devices and so on). Docker, on the other hand, is a containerization infrastructure that run processes on top of an existing, regular kernel. Not to diminish its merits, which are great in many areas, but it's closer to a kind of glorified chroot. So, it's no surprise that Docker performance would be the same as the underlying OS's, while KVM overhead is much more important. There's a full layer of virtualization difference between the two. And they also a ran single VM or container per host. It would probably also be interesting to see what happens when your run multiple VMs or multiple containers on the same host. I guess it's nice somebody took the time to do the test, to ensure that Docker management or the LXC infrastructure was not impacting the containers performance too much, but I'm not sure I really understand the goal of the paper. Worst case, it will probably be misleading for people who will end up comparing two different tools with very different purposes and use cases. "What do you mean I can not upgrade the kernel in my container?" Cheers, -- Kilian From j.sassmannshausen at ucl.ac.uk Tue Aug 19 15:08:16 2014 From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 19 Aug 2014 23:08:16 +0100 Subject: [Beowulf] LSI Megaraid stalls system on very high IO? In-Reply-To: <53F1EF51.2050206@diamond.ac.uk> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> Message-ID: <201408192308.18212.j.sassmannshausen@ucl.ac.uk> Hi Greg, thanks for the email. I agree, I will be lucky to get such a machine. What I probably will do is go for a modern motherboard and try and get a PIC-e SCSI card. I hope at least they exist.... All the best from a cold London J?rg On Montag 18 August 2014 Gregory Matthews wrote: > On 16/08/14 08:46, J?rg Sa?mannshausen wrote: > > My problem: I got some old PCI-X LSI SCSI cards which are connected to > > some Infortrend storage boxes. We recently had a power-dip (lights went > > off and came back within 2 sec) and now the 10 year old frontend is > > playing up. So I need a new frontend and it seems very difficutl to get > > a PCI-e to PCI-X riser card so I can get a newer motherboard with more > > cores and more memory. > > good luck with that! Those technologies are pretty incompatible. There > are one or two PCIe (x1) to PCI (maybe compatible with PCI-X - check > voltages etc.) converters but I wouldn't trust them with my storage. > > The last server we bought that was still compatible with PCI-X was a > Dell Poweredge R200, you needed to specify PCI-X riser when buying. > Maybe ebay is your best bet at this point? > > GREG > > > Hence the thread was good for me to read as I hopefully can configure the > > frontend a bit better. > > > > If somebody got any comments on my problem feel free to reply. > > > > David: By the looks of it you will compress larger files on a regular > > base. Have you considered using the parallel version of gzip? Per > > default it is using all available cores but you can change that in the > > command line. That way you might avoid the problem with disc I/O and > > simply use the available cores. You also could do a 'nice' to make sure > > the machine does not become unresponsive due to high CPU load. Just an > > idea to speed up your decompressions. > > > > All the best from a sunny London > > > > J?rg > > > > On Freitag 15 August 2014 Dimitris Zilaskos wrote: > >> Hi, > >> > >> I hope your issue has been resolved meanwhile. I had a somehow similar > >> mixed experience with Dell branded LSI controllers. It would appear > >> that some models are just not fit for particular workloads. I have put > >> some information in our blog at > >> http://www.gridpp.rl.ac.uk/blog/2013/06/14/lsi-1068e-issues-understood-a > >> nd- resolved/ > >> > >> Cheers, > >> > >> Dimitris > >> > >> On Thu, Jul 31, 2014 at 7:37 PM, mathog wrote: > >>> Any pointers on why a system might appear to "stall" on very high IO > >>> through an LSI megaraid adapter? (dm_raid45, on RHEL 5.10.) > >>> > >>> I have been working on another group's big Dell server, which has 16 > >>> CPUs, 82 GB of memory, and 5 1TB disks which go through an LSI Megaraid > >>> (not sure of the exact configuration and their system admin is out > >>> sick) and show up as /dev/sda[abc], where the first two are just under > >>> 2 TB and the third is /boot and is about 133 Gb. sda and sdb are then > >>> combined through lvm into one big volume and that is what is mounted. > >>> > >>> Yesterday on this system when I ran 14 copies of this simultaneously: > >>> # X is 0-13 > >>> gunzip -c bigfile${X}.gz > resultfile${X} > >>> > >>> the first time, part way through, all of my terminals locked up for > >>> several minutes, and then recovered. Another similar command had the > >>> same issue about half an hour later, but others between and since did > >>> not stall. The size of the files unpacked is only about 0.5Gb, so even > >>> if the entire file was stored in memory in the pipes all 14 should have > >>> fit in main memory. Nothing else was running (at least that I noticed > >>> before or after, something might have started up during the run and > >>> ended before I could look for it.) During this period the system would > >>> still answer pings. Nothing showed up in /var/log/messages or dmesg, > >>> "last" showed nobody else had logged in, and overnight runs of > >>> "smartctl -t long" on the 5 disks were clean - nothing pending, no > >>> reallocation events. > >>> > >>> Today ran the first set of commands again with "nice 10" and had "top" > >>> going and nothing untoward was observed and there were no stalls. On > >>> that run iostat showed: > >>> > >>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > >>> sda 6034.00 0.00 529504.00 0 529504 > >>> sda5 6034.00 0.00 529504.00 0 529504 > >>> dm-0 68260.00 2056.00 546008.00 2056 546008 > >>> > >>> > >>> So why the apparent stalls yesterday? It felt like either my > >>> interactive processes were swapped out or they had a much lower > >>> priority than enough other processes so that they were not getting any > >>> CPU time. Is there some sort of housekeeping that the Megaraid, LVM, > >>> or anything normally installed with RHEL 5.10, might need to do, from > >>> time to time, that would account for these stalls? > >>> > >>> Thanks, > >>> > >>> David Mathog > >>> mathog at caltech.edu > >>> Manager, Sequence Analysis Facility, Biology Division, Caltech > >>> _______________________________________________ > >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > >>> Computing To change your subscription (digest mode or unsubscribe) > >>> visit http://www.beowulf.org/mailman/listinfo/beowulf > >> > >> _______________________________________________ > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > >> To change your subscription (digest mode or unsubscribe) visit > >> http://www.beowulf.org/mailman/listinfo/beowulf -- ************************************************************* Dr. J?rg Sa?mannshausen, MRSC University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshausen at ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From bill at cse.ucdavis.edu Wed Aug 27 19:29:00 2014 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 27 Aug 2014 19:29:00 -0700 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <201408192308.18212.j.sassmannshausen@ucl.ac.uk> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> Message-ID: <53FE93EC.3080000@cse.ucdavis.edu> The URL: http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing One piece I found particularly interesting: There has been very little open source that has made its way into broad use within the HPC commercial community where great emphasis is placed on serviceability and security. There is a better track record in data analytics recently with map/reduce as a notable example. This is less of an issue for universities or national laboratories but they represent no more than about 10%-15% of all HPC usage. Of course, one cannot ?force? the adoption of open source but one should also not plan on it being a panacea to any ecosystem shortcoming. A focus investment effort within universities could expand the volume of open source and increase the chances that some of the software output could become commercialized. It should be noted that the most significant consumption of open source software is China and it is also the case that the Chinese are rare contributors to open source as well. Investments in open source or other policy actions to stimulate creation are likely to produce a disproportionate benefit accruing to the Chinese. From bug at wharton.upenn.edu Thu Aug 28 05:26:44 2014 From: bug at wharton.upenn.edu (Gavin W. Burris) Date: Thu, 28 Aug 2014 08:26:44 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FE93EC.3080000@cse.ucdavis.edu> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> Message-ID: <20140828122644.GC3780@shadow> Hi, Bill. This is perplexing... So, the Linux kernel and supporting tools that make the operating system aren't being factored in here? The compiler? The libraries? If "very little open source" has "made its way into broad use within HPC," what OS are the majority running if not Linux? This seem to be greatly uninformed, or pushing an agenda. The only way I can see this excerpt as even remotely true would be if you applied a very narrow survey to a specific application set. But that narrow view does not apply to a full operational stack or all of HPC in general! I'm baffled, because this does not jive with my lay of the land. Cheers. On 07:29PM Wed 08/27/14 -0700, Bill Broadley wrote: > > The URL: > http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing > > One piece I found particularly interesting: > > There has been very little open source that has made its way into broad use > within the HPC commercial community where great emphasis is placed on > serviceability and security. There is a better track record in data analytics > recently with map/reduce as a notable example. This is less of an issue for > universities or national laboratories but they represent no more than about > 10%-15% of all HPC usage. Of course, one cannot ?force? the adoption of open > source but one should also not plan on it being a panacea to any ecosystem > shortcoming. A focus investment effort within universities could expand the > volume of open source and increase the chances that some of the software > output could become commercialized. It should be noted that the most > significant consumption of open source software is China and it is also the > case that the Chinese are rare contributors to open source as well. > Investments in open source or other policy actions to stimulate creation are > likely to produce a disproportionate benefit accruing to the Chinese. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gavin W. Burris Senior Project Leader for Research Computing The Wharton School University of Pennsylvania From dmitri.chubarov at gmail.com Thu Aug 28 05:46:16 2014 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Thu, 28 Aug 2014 19:46:16 +0700 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <20140828122644.GC3780@shadow> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> Message-ID: Hi, Gavin, It seems to be an inevitable conclusion that in the view of the authors of the Report Beowulf is not HPC. All the best On Thu, Aug 28, 2014 at 7:26 PM, Gavin W. Burris wrote: > Hi, Bill. > > This is perplexing... > > So, the Linux kernel and supporting tools that make the operating system > aren't > being factored in here? The compiler? The libraries? If "very little > open > source" has "made its way into broad use within HPC," what OS are the > majority > running if not Linux? This seem to be greatly uninformed, or pushing an > agenda. The only way I can see this excerpt as even remotely true would > be if > you applied a very narrow survey to a specific application set. But that > narrow view does not apply to a full operational stack or all of HPC in > general! I'm baffled, because this does not jive with my lay of the land. > > Cheers. > > On 07:29PM Wed 08/27/14 -0700, Bill Broadley wrote: > > > > The URL: > > > http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing > > > > One piece I found particularly interesting: > > > > There has been very little open source that has made its way into > broad use > > within the HPC commercial community where great emphasis is placed on > > serviceability and security. There is a better track record in data > analytics > > recently with map/reduce as a notable example. This is less of an > issue for > > universities or national laboratories but they represent no more than > about > > 10%-15% of all HPC usage. Of course, one cannot ?force? the adoption > of open > > source but one should also not plan on it being a panacea to any > ecosystem > > shortcoming. A focus investment effort within universities could > expand the > > volume of open source and increase the chances that some of the > software > > output could become commercialized. It should be noted that the most > > significant consumption of open source software is China and it is > also the > > case that the Chinese are rare contributors to open source as well. > > Investments in open source or other policy actions to stimulate > creation are > > likely to produce a disproportionate benefit accruing to the Chinese. > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Gavin W. Burris > Senior Project Leader for Research Computing > The Wharton School > University of Pennsylvania > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ellis at cse.psu.edu Thu Aug 28 05:46:21 2014 From: ellis at cse.psu.edu (Ellis H. Wilson III) Date: Thu, 28 Aug 2014 08:46:21 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <20140828122644.GC3780@shadow> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> Message-ID: <53FF249D.6080101@cse.psu.edu> On 08/28/2014 08:26 AM, Gavin W. Burris wrote: > So, the Linux kernel and supporting tools that make the operating system aren't > being factored in here? The compiler? The libraries? If "very little open > source" has "made its way into broad use within HPC," what OS are the majority > running if not Linux? This seem to be greatly uninformed, or pushing an > agenda. The only way I can see this excerpt as even remotely true would be if > you applied a very narrow survey to a specific application set. But that > narrow view does not apply to a full operational stack or all of HPC in > general! I'm baffled, because this does not jive with my lay of the land. Couldn't agree more. An incredibly boring read to boot. "Dr. Moniz requested that the Task Force look at the problems and opportunities that will drive the need for next generation high performance computing (HPC), what will be required to execute a successful path to deliver next generation leading edge HPC..." Apparently means, "Please deliver a report with as many hype words as possible, with little connection to how HPC is actually done, and conclude with how bielleons of dollars need invested to make all of this hype possible! And in looking way into the future, don't lose sight of our good friends in Oil & Gas. They'll need at least a page about them! Photovoltaics? Eh. A paragraph should suffice." Reads like a grant to me! ellis From cbergstrom at pathscale.com Thu Aug 28 05:49:49 2014 From: cbergstrom at pathscale.com (=?UTF-8?B?IkMuIEJlcmdzdHLDtm0i?=) Date: Thu, 28 Aug 2014 19:49:49 +0700 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <20140828122644.GC3780@shadow> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> Message-ID: <53FF256D.5050208@pathscale.com> On 08/28/14 07:26 PM, Gavin W. Burris wrote: > Hi, Bill. > > This is perplexing... > > So, the Linux kernel and supporting tools that make the operating system aren't > being factored in here? The compiler? The libraries? If "very little open > source" has "made its way into broad use within HPC," what OS are the majority > running if not Linux? This seem to be greatly uninformed, or pushing an > agenda. The only way I can see this excerpt as even remotely true would be if > you applied a very narrow survey to a specific application set. But that > narrow view does not apply to a full operational stack or all of HPC in > general! I'm baffled, because this does not jive with my lay of the land. baffled you say? Lets go down the list of things you mentioned supporting tools - Allinea/Totalview and the various performance analysis tools - are they open source? (partially maybe, but not completely) compiler - I'll refrain from selfish self advertising, but with the exception of gcc - anything else I've seen on a cluster and when people care about performance - they likely use something which is closed source. (I don't know many Gordn Bell winners using gcc.) libs - Off the top of my head.. MKL, NAG, cuBLAS and a few others aren't open source. (Ok LAPCK, OpenBLAS and a few others are open source... got me here) MPI - From what I've seen the larger OEM and system integrators end up effectively creating a closed source version from one of the open source things. The linux kernel is open source, but what about the highly modified compute node OS which are common? I doubt there's a single customer who has requested the source and published those modifications.. Good schedulers? --------- We can go deeper into domain specific stuff and then it's really mixed bag for what's open and what's not... The original post was probably clickbait and irrelevant anyway. Who cares? ./C From bug at wharton.upenn.edu Thu Aug 28 06:31:58 2014 From: bug at wharton.upenn.edu (Gavin W. Burris) Date: Thu, 28 Aug 2014 09:31:58 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FF256D.5050208@pathscale.com> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF256D.5050208@pathscale.com> Message-ID: <20140828133158.GD3780@shadow> Hi, C. Yes, there are many closed source, domain-specific, proprietary tools, and I like them! Any full HPC stack has many open source pieces as functional components. One cannot pick what does or does not count. To this reports defense, it does read as if to reference the near future, a future where we throw out all existing code and start from scratch. The report seems to be making a sci-fi-esque call for a complete rethink and reinvention of billions-and-billions of lines of code. Very admirable. Cheers. On 07:49PM Thu 08/28/14 +0700, "C. Bergstr?m" wrote: > On 08/28/14 07:26 PM, Gavin W. Burris wrote: > >Hi, Bill. > > > >This is perplexing... > > > >So, the Linux kernel and supporting tools that make the operating system aren't > >being factored in here? The compiler? The libraries? If "very little open > >source" has "made its way into broad use within HPC," what OS are the majority > >running if not Linux? This seem to be greatly uninformed, or pushing an > >agenda. The only way I can see this excerpt as even remotely true would be if > >you applied a very narrow survey to a specific application set. But that > >narrow view does not apply to a full operational stack or all of HPC in > >general! I'm baffled, because this does not jive with my lay of the land. > baffled you say? > > Lets go down the list of things you mentioned > > supporting tools - Allinea/Totalview and the various performance analysis > tools - are they open source? (partially maybe, but not completely) > > compiler - I'll refrain from selfish self advertising, but with the > exception of gcc - anything else I've seen on a cluster and when people care > about performance - they likely use something which is closed source. (I > don't know many Gordn Bell winners using gcc.) > > libs - Off the top of my head.. MKL, NAG, cuBLAS and a few others aren't > open source. (Ok LAPCK, OpenBLAS and a few others are open source... got me > here) > > MPI - From what I've seen the larger OEM and system integrators end up > effectively creating a closed source version from one of the open source > things. > > The linux kernel is open source, but what about the highly modified compute > node OS which are common? I doubt there's a single customer who has > requested the source and published those modifications.. > > Good schedulers? > --------- > We can go deeper into domain specific stuff and then it's really mixed bag > for what's open and what's not... > > The original post was probably clickbait and irrelevant anyway. Who cares? > > ./C > -- Gavin W. Burris Senior Project Leader for Research Computing The Wharton School University of Pennsylvania From bug at wharton.upenn.edu Thu Aug 28 06:34:17 2014 From: bug at wharton.upenn.edu (Gavin W. Burris) Date: Thu, 28 Aug 2014 09:34:17 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FF249D.6080101@cse.psu.edu> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF249D.6080101@cse.psu.edu> Message-ID: <20140828133417.GE3780@shadow> On 08:46AM Thu 08/28/14 -0400, Ellis H. Wilson III wrote: > Reads like a grant to me! > Snort my coffee there. Hilarious. -- Gavin W. Burris Senior Project Leader for Research Computing The Wharton School University of Pennsylvania From atchley at tds.net Thu Aug 28 06:45:34 2014 From: atchley at tds.net (atchley tds.net) Date: Thu, 28 Aug 2014 09:45:34 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <20140828133158.GD3780@shadow> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF256D.5050208@pathscale.com> <20140828133158.GD3780@shadow> Message-ID: On Thu, Aug 28, 2014 at 9:31 AM, Gavin W. Burris wrote: > Hi, C. > > Yes, there are many closed source, domain-specific, proprietary tools, and > I > like them! Any full HPC stack has many open source pieces as functional > components. One cannot pick what does or does not count. > > To this reports defense, it does read as if to reference the near future, a > future where we throw out all existing code and start from scratch. The > report > seems to be making a sci-fi-esque call for a complete rethink and > reinvention > of billions-and-billions of lines of code. Very admirable. > > Cheers. I do not read it as a call to throw out everything. DOE plans to support MPI+X (where X can be OpenMP or OpenACC or something else) for the foreseeable future. Asking scientists to rewrite from scratch is a tall order. The problem with MPI+X now is that we are tuning for specific machines at specific sites. It is no longer portable solution by itself (i.e. write it once and run everywhere and expect the best performance). Some prefer a PGAS model and are looking at it as a hedge against MPI scaling issues for very large system. The thinking is, that with millions of nodes, that MPI rank lookup information becomes to consume more memory than we want. Still others are proposing event-driven task models (EDTs) such as Legion and the Open Community Runtime. These promise the ability to write once and let the compiler and runtime extract the best performance from the given hardware. The downside is the complete rewrite and the difficulty to diagnose poor performance (i.e. did I do a bad job or was it the compiler's or runtime's fault). To DOE's credit, they are not picking and choosing. They are funding R&D such as Fast Forward and Design Forward as well as software development. They _do_ want to see the investments pay off and be used. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From prentice.bisbal at rutgers.edu Thu Aug 28 06:54:43 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Thu, 28 Aug 2014 09:54:43 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FF3392.2030803@rutgers.edu> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF3392.2030803@rutgers.edu> Message-ID: <53FF34A3.7050907@rutgers.edu> Disclaimer: I didn't read the full report. I was only responding to the quote included in the original e-mail. It sounds like Gavin was responding to the whole report, based on subsequent posts. I'll read the whole report and then post my own obligatory rant. Prentice On 08/28/2014 09:50 AM, Prentice Bisbal wrote: > Gavin, > > You didn't read the full sentence. The keyword is 'commercial' (I > added the emphasis): > >> There has been very little open source that has made its way into >> broad use >> within the HPC COMMERCIAL community where great emphasis is placed on >> serviceability and security > > This shouldn't be news to most of us. In the commercial world, it > seems a lot of managers want to pay for commercial software so they > can call/blame/sue someone when something goes wrong with the > software. This is why Red Hat Enterprise Linux exists. > > Prentice Bisbal > Manager of Information Technology > Rutgers Discovery Informatics Institute (RDI2) > Rutgers University > http://rdi2.rutgers.edu > > On 08/28/2014 08:26 AM, Gavin W. Burris wrote: >> Hi, Bill. >> >> This is perplexing... >> >> So, the Linux kernel and supporting tools that make the operating >> system aren't >> being factored in here? The compiler? The libraries? If "very >> little open >> source" has "made its way into broad use within HPC," what OS are the >> majority >> running if not Linux? This seem to be greatly uninformed, or pushing an >> agenda. The only way I can see this excerpt as even remotely true >> would be if >> you applied a very narrow survey to a specific application set. But that >> narrow view does not apply to a full operational stack or all of HPC in >> general! I'm baffled, because this does not jive with my lay of the >> land. >> >> Cheers. >> >> On 07:29PM Wed 08/27/14 -0700, Bill Broadley wrote: >>> The URL: >>> http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing >>> >>> >>> One piece I found particularly interesting: >>> >>> There has been very little open source that has made its way into >>> broad use >>> within the HPC commercial community where great emphasis is >>> placed on >>> serviceability and security. There is a better track record in >>> data analytics >>> recently with map/reduce as a notable example. This is less of an >>> issue for >>> universities or national laboratories but they represent no more >>> than about >>> 10%-15% of all HPC usage. Of course, one cannot ?force? the >>> adoption of open >>> source but one should also not plan on it being a panacea to any >>> ecosystem >>> shortcoming. A focus investment effort within universities could >>> expand the >>> volume of open source and increase the chances that some of the >>> software >>> output could become commercialized. It should be noted that the most >>> significant consumption of open source software is China and it >>> is also the >>> case that the Chinese are rare contributors to open source as well. >>> Investments in open source or other policy actions to stimulate >>> creation are >>> likely to produce a disproportionate benefit accruing to the >>> Chinese. >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf > From prentice.bisbal at rutgers.edu Thu Aug 28 06:50:10 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Thu, 28 Aug 2014 09:50:10 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <20140828122644.GC3780@shadow> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> Message-ID: <53FF3392.2030803@rutgers.edu> Gavin, You didn't read the full sentence. The keyword is 'commercial' (I added the emphasis): > There has been very little open source that has made its way into broad use > within the HPC COMMERCIAL community where great emphasis is placed on > serviceability and security This shouldn't be news to most of us. In the commercial world, it seems a lot of managers want to pay for commercial software so they can call/blame/sue someone when something goes wrong with the software. This is why Red Hat Enterprise Linux exists. Prentice Bisbal Manager of Information Technology Rutgers Discovery Informatics Institute (RDI2) Rutgers University http://rdi2.rutgers.edu On 08/28/2014 08:26 AM, Gavin W. Burris wrote: > Hi, Bill. > > This is perplexing... > > So, the Linux kernel and supporting tools that make the operating system aren't > being factored in here? The compiler? The libraries? If "very little open > source" has "made its way into broad use within HPC," what OS are the majority > running if not Linux? This seem to be greatly uninformed, or pushing an > agenda. The only way I can see this excerpt as even remotely true would be if > you applied a very narrow survey to a specific application set. But that > narrow view does not apply to a full operational stack or all of HPC in > general! I'm baffled, because this does not jive with my lay of the land. > > Cheers. > > On 07:29PM Wed 08/27/14 -0700, Bill Broadley wrote: >> The URL: >> http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing >> >> One piece I found particularly interesting: >> >> There has been very little open source that has made its way into broad use >> within the HPC commercial community where great emphasis is placed on >> serviceability and security. There is a better track record in data analytics >> recently with map/reduce as a notable example. This is less of an issue for >> universities or national laboratories but they represent no more than about >> 10%-15% of all HPC usage. Of course, one cannot ?force? the adoption of open >> source but one should also not plan on it being a panacea to any ecosystem >> shortcoming. A focus investment effort within universities could expand the >> volume of open source and increase the chances that some of the software >> output could become commercialized. It should be noted that the most >> significant consumption of open source software is China and it is also the >> case that the Chinese are rare contributors to open source as well. >> Investments in open source or other policy actions to stimulate creation are >> likely to produce a disproportionate benefit accruing to the Chinese. >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From joshua_mora at usa.net Thu Aug 28 07:17:28 2014 From: joshua_mora at usa.net (Joshua Mora) Date: Thu, 28 Aug 2014 07:17:28 -0700 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing Message-ID: <987sHboqC2752S06.1409235448@web06.cms.usa.net> The codesign effort pushed by the new requirements/constraints (power and performance) is shaking design decision of existing SW frameworks, hence forcing to get rewritten overtime to add new fundamental functionality (eg. progress threads for asynchronous communication and fault tolerance). The report, as a holistic view of the challenge, suffers from lack of detail although references are provided for some of their statements. It is obvious that such effort will require multi year investment in a wide range of disciplines. It seems a reasonable amount based on the investments within the HW industry that will seek ROI. I personally think this cannot be anymore a country-only solely lead effort and in terms of budget to develop such program, China could be already better set. Keeping Intellectual Property is going to be difficult under the pressure to deliver within timelines. Joshua. > To this reports defense, it does read as if to reference the near future, a > future where we throw out all existing code and start from scratch. The report > seems to be making a sci-fi-esque call for a complete rethink and reinvention > of billions-and-billions of lines of code. Very admirable. From prentice.bisbal at rutgers.edu Thu Aug 28 08:22:02 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Thu, 28 Aug 2014 11:22:02 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <987sHboqC2752S06.1409235448@web06.cms.usa.net> References: <987sHboqC2752S06.1409235448@web06.cms.usa.net> Message-ID: <53FF491A.5030200@rutgers.edu> On 08/28/2014 10:17 AM, Joshua Mora wrote: > The codesign effort pushed by the new requirements/constraints (power and > performance) is shaking design decision of existing SW frameworks, hence > forcing to get rewritten overtime to add new fundamental functionality (eg. > progress threads for asynchronous communication and fault tolerance). > > The report, as a holistic view of the challenge, suffers from lack of detail > although references are provided for some of their statements. It is obvious > that such effort will require multi year investment in a wide range of > disciplines. It seems a reasonable amount based on the investments within the > HW industry that will seek ROI. > > I personally think this cannot be anymore a country-only solely lead effort > and in terms of budget to develop such program, China could be already better > set. Keeping Intellectual Property is going to be difficult under the pressure > to deliver within timelines. > China clearly has the budget and motivation for this sort of work, but the software skills in China are still considerably behind the rest of the HPC superpowers. This is something China readily admits, and is working to address.* At the pace their, moving though, I imagine it won't be long before this is fixed, but a cultural change like that would still probably take some time, I'd say 10 or more years. * My source for this statement is a seminar Bill Tang from Princeton University gave a few years ago titled "Perspectives on China?s Role in Global High Performance Computing" (see description at http://www.princeton.edu/researchcomputing/education/colloquia/). -- Prentice From deadline at eadline.org Thu Aug 28 08:57:16 2014 From: deadline at eadline.org (Douglas Eadline) Date: Thu, 28 Aug 2014 11:57:16 -0400 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FE93EC.3080000@cse.ucdavis.edu> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> Message-ID: <0de208e350285f45bc07fd4f05626c05.squirrel@mail.eadline.org> What the hell, I'll bite, and I did not read the full report. "There has been very little open source that has made its way into broad use within the HPC commercial community where great emphasis is placed on serviceability and security." This sentence seems to imply that serviceability and security are holding back open source in HPC while ignoring the general uptake of HPC by industry. While there is merit to this claim, there are many other "hold backs" to commercial uptake for "Blue Collar HPC". It remains to be seen if these other hold backs are addressed, if open source solutions will see a greater uptake, thus creating a real economic incentive for such features. "There is a better track record in data analytic recently with map/reduce as a notable example." Well sure, a much, much bigger market changes the economic equation. -- Doug > > The URL: > http://energy.gov/seab/downloads/draft-report-task-force-high-performance-computing > > One piece I found particularly interesting: > > There has been very little open source that has made its way into broad > use > within the HPC commercial community where great emphasis is placed on > serviceability and security. There is a better track record in data > analytics > recently with map/reduce as a notable example. This is less of an issue > for > universities or national laboratories but they represent no more than > about > 10%-15% of all HPC usage. Of course, one cannot ?force? the adoption of > open > source but one should also not plan on it being a panacea to any > ecosystem > shortcoming. A focus investment effort within universities could expand > the > volume of open source and increase the chances that some of the > software > output could become commercialized. It should be noted that the most > significant consumption of open source software is China and it is also > the > case that the Chinese are rare contributors to open source as well. > Investments in open source or other policy actions to stimulate > creation are > likely to produce a disproportionate benefit accruing to the Chinese. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Mailscanner: Clean > > -- Doug -- Mailscanner: Clean From joshua_mora at usa.net Thu Aug 28 09:19:15 2014 From: joshua_mora at usa.net (Joshua Mora) Date: Thu, 28 Aug 2014 09:19:15 -0700 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing Message-ID: <967sHbqsP1456S05.1409242755@web05.cms.usa.net> > This is something China readily admits, and is > working to address.* At the pace their, moving though, I imagine it > won't be long before this is fixed, but a cultural change like that > would still probably take some time, I'd say 10 or more years. I read on an interview to a scientist on another discipline (bio science) that China is starting to offer to leading scientists in their fields to bring them over and put in their hands a fully equipped lab (including staff of +50) to speed up their R&D. So depending on urgency, I do not think there is a need for cultural change. It seems to me like a shift on world economics in 21st century will produce a wave of scientist to move to China, very much in the same way it happened in US in 20th century. Maybe that personal forward looking statement is sci-fi, we'll see. Joshua From james.p.lux at jpl.nasa.gov Thu Aug 28 13:26:24 2014 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 28 Aug 2014 20:26:24 +0000 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF256D.5050208@pathscale.com> <20140828133158.GD3780@shadow> Message-ID: (sorry about top post.. Outlook.. need I say more) This time of year (end of US Govt Fiscal Year) is a time for reports and such. As Scott points out, program managers are always under the gun to show that R&D investments that they have made are ?transitioning to operational use? or that the investment was sufficiently worthy. We leave a discussion of the inherently risky nature of R&D investments to others. Jim Lux From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of atchley tds.net Sent: Thursday, August 28, 2014 6:46 AM To: Gavin W. Burris Cc: Beowulf Mailing List Subject: Re: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing To DOE's credit, they are not picking and choosing. They are funding R&D such as Fast Forward and Design Forward as well as software development. They _do_ want to see the investments pay off and be used. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Thu Aug 28 18:41:30 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 29 Aug 2014 11:41:30 +1000 Subject: [Beowulf] Open source and the Draft Report of the Task Force on High Performance Computing In-Reply-To: <53FF256D.5050208@pathscale.com> References: <201408160846.24726.j.sassmannshausen@ucl.ac.uk> <53F1EF51.2050206@diamond.ac.uk> <201408192308.18212.j.sassmannshausen@ucl.ac.uk> <53FE93EC.3080000@cse.ucdavis.edu> <20140828122644.GC3780@shadow> <53FF256D.5050208@pathscale.com> Message-ID: <53FFDA4A.30805@unimelb.edu.au> Hiya, On 28/08/14 22:49, "C. Bergstr?m" wrote: > The linux kernel is open source, but what about the highly modified > compute node OS which are common? I doubt there's a single customer who > has requested the source and published those modifications.. The only one I'm aware of that's around these days with much presence is the BlueGene/Q Compute Node Kernel (CNK) and that's released under the Eclipse Public License with the tarballs downloadable from here: https://repo.anl-external.org/viewvc/bgq-driver/ Not tried to build it (yet), though I have been tempted to patch it. :-) cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From mdidomenico4 at gmail.com Fri Aug 29 05:31:41 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 29 Aug 2014 08:31:41 -0400 Subject: [Beowulf] mpi slow pairs Message-ID: does anyone have any mpi code that checks for slow pairs between mpi ranks. and no i don't me a micro benchmark that i can run a bunch of times and tabulate the results. i mean a program i can submit with a bunch of ranks and have it test the throughput between each and spit out a result for all the pairs i need to test the pair wise throughput across a large machine. i'm seeing a strange infiniband issue where some pairs are full speed but others are not From mdidomenico4 at gmail.com Fri Aug 29 08:30:09 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 29 Aug 2014 11:30:09 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> Message-ID: On Fri, Aug 29, 2014 at 9:32 AM, John Hearns wrote: > I would say the usual tool for that pair-wise comparison is Intel IBM > https://software.intel.com/en-us/articles/intel-mpi-benchmarks > I hope I have got your requirement correct! John, Close, but not exact. IMB will test ranks, but will not tell me if a specific pair of ranks is slower then others, only the collective of the ranks under test. what i'm looking for is an mpi version of this for x in node1->node100 for y in node1->node100 if x==y then skip else mpirun -n 2 -npernode 1 -host $x,$y bwtest > $x$y.log unfortunately, the mpirun task takes about 3secs per iteration, and with 10k iterations, it's going to take along time and i'm being impatient. i've been trying to write the mpi code myself, but my mpi is a little rusty so it's slow going... > Also have you run ibdiagnet to see if anything is flagged up? i've run a multitude of ib diags on the machines, but nothing is popping out as wrong. what's weird is that it's only certain pairing of machines not any one machine in general. From lindahl at pbm.com Fri Aug 29 08:49:57 2014 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 29 Aug 2014 08:49:57 -0700 Subject: [Beowulf] mpi slow pairs In-Reply-To: References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> Message-ID: <20140829154957.GA6879@bx9.net> On Fri, Aug 29, 2014 at 11:30:09AM -0400, Michael Di Domenico wrote: > > Also have you run ibdiagnet to see if anything is flagged up? > > i've run a multitude of ib diags on the machines, but nothing is > popping out as wrong. what's weird is that it's only certain pairing > of machines not any one machine in general. Huh, Intel (PathScale/QLogic) has shipped a NxN debugging program for more than a decade. The first vendor I recall shipping such a program was Microway. I guess it takes a while for good practices to spread throughout our community! -- greg From mdidomenico4 at gmail.com Fri Aug 29 09:09:45 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 29 Aug 2014 12:09:45 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: <88F9D072D5E6434BB9A49625A73D129269AD55@SRV-vEX2.viglen.co.uk> References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> <88F9D072D5E6434BB9A49625A73D129269AD55@SRV-vEX2.viglen.co.uk> Message-ID: On Fri, Aug 29, 2014 at 11:38 AM, John Hearns wrote: >> Also have you run ibdiagnet to see if anything is flagged up? > > i've run a multitude of ib diags on the machines, but nothing is popping out as wrong. what's weird is that it's only certain pairing of machines not any one machine in general. > > Would that then be a problem in one of the blades or a part of the switch? not sure yet, i think on the spine modules in the switch is silently failing to send traffic a full speed, but i've not been able to "prove" this yet. From lindahl at pbm.com Fri Aug 29 10:54:41 2014 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 29 Aug 2014 10:54:41 -0700 Subject: [Beowulf] mpi slow pairs In-Reply-To: <20140829154957.GA6879@bx9.net> References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> <20140829154957.GA6879@bx9.net> Message-ID: <20140829175441.GA15227@bx9.net> On Fri, Aug 29, 2014 at 08:49:57AM -0700, Greg Lindahl wrote: > Huh, Intel (PathScale/QLogic) has shipped a NxN debugging program for > more than a decade. The first vendor I recall shipping such a program > was Microway. I guess it takes a while for good practices to spread > throughout our community! And in the credit-where-credit-is-due department, looks like PathScale originally got the code from Dr. Panda's group at OSU. -- greg From mdidomenico4 at gmail.com Fri Aug 29 13:20:18 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 29 Aug 2014 16:20:18 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: <201408291927.s7TJR5Bv026194@nmsh1.nsc.no> References: <201408291927.s7TJR5Bv026194@nmsh1.nsc.no> Message-ID: On Fri, Aug 29, 2014 at 3:26 PM, H?kon Bugge wrote: > Hmm, are all pairs going through the spine? If not, look up the parking-lot > problem. H?kon i believe all the pairs do pass through a spine. i'm not familiar with the "parking-lot problem", i'll google it, but suspect a bazillion hits will come back > Sendt fra min HTC > > ----- Reply message ----- > Fra: "Michael Di Domenico" > Til: "Beowulf Mailing List" > Emne: [Beowulf] mpi slow pairs > Dato: fre., aug. 29, 2014 18:09 > > > > On Fri, Aug 29, 2014 at 11:38 AM, John Hearns > wrote: >>> Also have you run ibdiagnet to see if anything is flagged up? >> >> i've run a multitude of ib diags on the machines, but nothing is popping >> out as wrong. what's weird is that it's only certain pairing of machines >> not any one machine in general. >> >> Would that then be a problem in one of the blades or a part of the switch? > > not sure yet, i think on the spine modules in the switch is silently > failing to send traffic a full speed, but i've not been able to > "prove" this yet. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hakon.bugge at gmail.com Sun Aug 31 00:11:23 2014 From: hakon.bugge at gmail.com (=?iso-8859-1?Q?H=E5kon_Bugge?=) Date: Sun, 31 Aug 2014 09:11:23 +0200 Subject: [Beowulf] mpi slow pairs In-Reply-To: References: <201408291927.s7TJR5Bv026194@nmsh1.nsc.no> Message-ID: <60152910-B5C5-4F04-BB74-75BF27D26A74@gmail.com> On 29. aug. 2014, at 22.20, Michael Di Domenico wrote: > i believe all the pairs do pass through a spine. IB is destination routed. Without knowing your topology and/or traffic pattern, I expect the pairs connected to the same leaf not to go though the spine. And given that is the case, you are vulnerable to the parking-lot problem > i'm not familiar > with the "parking-lot problem", i'll google it, but suspect a > bazillion hits will come back You will. But if there are no erroneous components in your system, the behavior of your system is a function of the system itself and the workload given to it. The parking-lot problem might be an explanation of less than perfect bnehaviour. Not saying it is though. H?kon > > >> Sendt fra min HTC >> >> ----- Reply message ----- >> Fra: "Michael Di Domenico" >> Til: "Beowulf Mailing List" >> Emne: [Beowulf] mpi slow pairs >> Dato: fre., aug. 29, 2014 18:09 >> >> >> >> On Fri, Aug 29, 2014 at 11:38 AM, John Hearns >> wrote: >>>> Also have you run ibdiagnet to see if anything is flagged up? >>> >>> i've run a multitude of ib diags on the machines, but nothing is popping >>> out as wrong. what's weird is that it's only certain pairing of machines >>> not any one machine in general. >>> >>> Would that then be a problem in one of the blades or a part of the switch? >> >> not sure yet, i think on the spine modules in the switch is silently >> failing to send traffic a full speed, but i've not been able to >> "prove" this yet. >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From stewart at serissa.com Sun Aug 31 12:45:22 2014 From: stewart at serissa.com (Lawrence Stewart) Date: Sun, 31 Aug 2014 15:45:22 -0400 Subject: [Beowulf] mpi slow pairs In-Reply-To: References: <201408291927.s7TJR5Bv026194@nmsh1.nsc.no> Message-ID: I believe in this context, the parking lot problem refers to the problem of cars leaving a parking lot via one exit, with a tree of merge points before the exit. If each merge is "fair" then a particular flow sees a bandwidth of 1/(2**n) where n is the number of merge points to the exit. (Try getting off the roof deck of the Boston Science Museum parking garage at closing time!) This is talked about in Dally and Towles Principles of Interconnection Networks, my copy of which is at the office... This is an effect that only happens when there are a lot of flows, and there is congestion in the network somewhere. In IB, if I understand IB correctly, which is unlikely, congestion happens if you have, for example, a fat-tree which is not non-blocking (1/2 or 1/4 non-blocking) or if the sum of the flows exceeds that nodes input link. In these circumstances, flows which have fewer hops will get more bandwidth than flows which have more. In addition, flows which happen to use links congested by these slow flows will also become slow, for example due to head-of-line blocking or full switch buffers. None of these effects should be visible in a pairwise bandwidth test, which would only have one flow at a time in the network. Instead, the pairwise test ought to reveal slow pairs that cross, day, links with high error rates or bad switch ports. Such testing might give confusing results if the IB network is set up for dynamic routing, which might change flows to avoid slow links (not sure how this works in IB, but maybe it could be turned off.) Getting back to the original question, I'm not aware of such an MPI test, but if one isn't laying around in the Ohio State corpus or via Intel/Pathscale..., it shouldn't be hard to write one. I wasn't able to find a good Internet accessible reference of the Parking lot problem, but it is mentioned in First Experiences with Congestion Control in InfiniBand Hardware (Gran et all, 2010) The citation here is to Dally. -L On 2014, Aug 29, at 4:20 PM, Michael Di Domenico wrote: > On Fri, Aug 29, 2014 at 3:26 PM, H?kon Bugge wrote: >> Hmm, are all pairs going through the spine? If not, look up the parking-lot >> problem. H?kon > > i believe all the pairs do pass through a spine. i'm not familiar > with the "parking-lot problem", i'll google it, but suspect a > bazillion hits will come back > > >> Sendt fra min HTC >> >> ----- Reply message ----- >> Fra: "Michael Di Domenico" >> Til: "Beowulf Mailing List" >> Emne: [Beowulf] mpi slow pairs >> Dato: fre., aug. 29, 2014 18:09 >> >> >> >> On Fri, Aug 29, 2014 at 11:38 AM, John Hearns >> wrote: >>>> Also have you run ibdiagnet to see if anything is flagged up? >>> >>> i've run a multitude of ib diags on the machines, but nothing is popping >>> out as wrong. what's weird is that it's only certain pairing of machines >>> not any one machine in general. >>> >>> Would that then be a problem in one of the blades or a part of the switch? >> >> not sure yet, i think on the spine modules in the switch is silently >> failing to send traffic a full speed, but i've not been able to >> "prove" this yet. >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From samuel at unimelb.edu.au Sun Aug 31 16:31:14 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 01 Sep 2014 09:31:14 +1000 Subject: [Beowulf] mpi slow pairs In-Reply-To: <20140829154957.GA6879@bx9.net> References: <88F9D072D5E6434BB9A49625A73D129269AB87@SRV-vEX2.viglen.co.uk> <20140829154957.GA6879@bx9.net> Message-ID: <5403B042.4050100@unimelb.edu.au> On 30/08/14 01:49, Greg Lindahl wrote: > Huh, Intel (PathScale/QLogic) has shipped a NxN debugging program for > more than a decade. The first vendor I recall shipping such a program > was Microway. I guess it takes a while for good practices to spread > throughout our community! "The first rule of Infiniband debugging is nobody talks about Infiniband debugging". Got a link for it please? cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci