[Beowulf] What services do you run on your cluster nodes?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Donald Becker becker at scyld.comTue Sep 23 10:03:29 PDT 2008
- Previous message: [Beowulf] What services do you run on your cluster nodes?
- Next message: [Beowulf] What services do you run on your cluster nodes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 22 Sep 2008, Perry E. Metzger wrote: > > Prentice Bisbal <prentice at ias.edu> writes: > > The more services you run on your cluster node (gmond, sendmail, etc.) > > the less performance is available for number crunching, but at the same > > time, administration difficulty increases. For example, if you turn off > > postfix/sendmail, you'll no longer get automated e-mails from your > > system to alert you to a problem. > > If a machine isn't sending out more than, say, 20,000 email > messages an hour, you won't notice the additional load Postfix puts on > a modern machine with any reasonable measurement tool. > > FYI, a modern box running postfix can handle millions of messages per > hour before it starts getting into trouble. The overall load isn't the issue, it's the scheduling interference. If you have a dozen nodes working on a fine-grained, lock-step computation, nodes taking a millisecond off every second isn't noticed. If you have a few hundred nodes working on the problem, that millisecond is a huge problem. We recognized this effect over a decade ago. It was a motivation when we designed the Scyld cluster system in early 2000, and was a key point when we started talking about it back then. The effect has been independently discovered many times, but I think that we have one of the cleanest approaches. We solved the problem by using a full featured, fully-installed head ("master") node that ran all standard services, and having the rest of the nodes be start-from-zero compute slaves that don't run anything but the application. This is much different than "what can I eliminate" mindset. Designs that start from a full install and strip it down often eliminate too much, or don't understand that unused "idle" things aren't really free. Idle daemons frequently wake up, look around, and go back to sleep. Look at the research that has gone into making the Linux kernel "tick free". The focus has been on power savings rather than HPC, but their findings provide third-party confirmation. They eliminated periodic timer ticks, instead using a countdown timer to wake the kernel only when needed. Except that so many things wake up, look around, and go back to sleep that they didn't see much savings! The secondary effects are the real cost, and they are difficult to directly measure. Every time a daemon wakes, it kills application ... uhmm "momentum". It flushes a bunch of cache lines, and PTE lookaside entries. It might kick out a few pages and D-cache entries. These might break up application I/O that could otherwise be coalesced into a big request. How much time does all this cost? Well, much of the time not very much. But occasionally the coincidences stack up and become really expensive. Like a single driver stopping during rush-hour traffic, the whole cluster-wide app stops. Next posting: how the app itself can be the cause of slow-downs, and why cluster-specific nameservices and why library/executable memory "wire-downs" solve problems. -- Donald Becker becker at scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA
- Previous message: [Beowulf] What services do you run on your cluster nodes?
- Next message: [Beowulf] What services do you run on your cluster nodes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
