[Beowulf] What services do you run on your cluster nodes?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduTue Sep 23 04:09:03 PDT 2008
- Previous message: [Beowulf] What services do you run on your cluster nodes?
- Next message: [Beowulf] What services do you run on your cluster nodes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 23 Sep 2008, Ashley Pittman wrote: > On Mon, 2008-09-22 at 20:54 -0400, Perry E. Metzger wrote: >> >> By the way, if you really can't afford for things to "go away" for >> 1/250th of a second very often, I have horrible news for you: NO >> COMPUTER WILL WORK FOR YOU. > > To a large extent you are actually correct, this is one of the reasons > why building "large" clusters is hard. If you just plug them together > and install $mpi the performance will suck for this very reason. > > The thing to remember is "noise" or "jitter" probably doesn't affect > many people, it's effects become noticeable as job size (not cluster > size) increases and the effects are non-linear. You can probably run a > 128 node cluster and not notice it at all, beyond this scale however and > time spent turning things off it time well spent. In fact in my > experience noise becomes significant not just with size but as as square > of size. Note also (agreeing with you, and Greg, and H on) that it is "large clusters running certain kinds of parallel code". In particular synchronous code with lots of communications barriers. The point is that if you are running a parallel computation with a barrier (a point where all computation ceases on ALL nodes until a communication process completes that itself involves all nodes) then blocking a single node while it handles a random interrupt(ion) effectively blocks the entire computation. Part of the problem is the "random" bit. Computers have to deal with a lot of asynchronous stuff as this or that wakes up and needs attention. The timer interrupt, network interrupts, disk interrupts, and yeah, BIOSen clamoring for attention whenever THEY want to be serviced. Some of these things are "elective" in the sense that the kernel has some say on when they get scheduled, and in a large parallel computation the kernel states CAN self-organize into a partially synchronous state of operation. By synchronous I mean that if one could arrange it so that all the nodes were clock synchronized to microseconds and could agree to spend the SAME fraction of every second doing ALL housekeeping, even a very large cluster would get the use of that whole second less the fraction required by housekeeping less a bit of e.g. context switch latency and other overhead. If all nodes are running a synchronous parallel task that runs from the barrier in about the same amount of time doing about the same amount of work with the same systems calls in the same order, then everything "meshes", and if they all enter a wait state (give back the CPU) at about the same time, the kernel will usually wake up and do its cumulated lower-half handler work and so on synchronously as well and life is good. But randomness -- asynchronicity -- then is the devil, as it is nonlinearly amplified as soon as it lifts out of the noise in a truly large synchronous computation. Embarrassingly parallel applications, on the other hand, don't care. Even real parallel applications that are coarse grained and aren't particularly synchronous don't care "much" and will scale way out there even with a modest amount of background asynchronous noise. So Don B., does Scyld "solve" this problem by running the node kernels synchronously, so they all use the same jiffy-window to do housekeeping (obviously it at least partially solves it by not running anything unnecessary in terms of housekeeping, but "anybody" can do that:-)? > Note that it's not just the "OS" fluff which causes problems and turning > things off doesn't get you anything like 100% of the way there, some > deamons have to run (your job scheduler) so all you can do it tune them > or re-code them to use less CPU and some kernel versions are pretty bad, > one version of Red-Hat was effectively un-usable on clusters because of > kscand. The problem actually predates clusters -- it is a very old problem with LANs in general, especially TCP/IP/ethernet LANs. Back in the old days of vampire taps and coaxial cables, all network traffic ran on either a daisy chain ethernet or through a "hub". Every packet from every machine was seen by every other machine on the LAN segment. Although the networks were typically segmented at the ethernet level, they were often not segmented at the the TCP/IP level as routers were expensive and difficult to manage. The network interface of one's machine was therefore effectively on a large flat TCP/IP network one that could literally span several cities or states. In a few cases (e.g. decnet) it could span the nation!. This meant that there could be hundreds or even thousands of machines that saw every packet produced by every other machine on the LAN, possibly after a few ethernet bridge hops. This made conditions ripe for what used to be called a "packet storm" (a term that has been subverted as the name and trademark of a company, I see, but alas there is no wikipedia article on same and even googled definitions seem scarce, so it is apparently on its way to being a forgotten concept). A packet storm occurred because the kernel has to examine every packet that comes in to its network interface to decide whether or not it is for it. Back in those days, computers were slow and a significant fraction of the total CPU power of your computer could easily be spent handling packets intended for the other systems on your LAN if the LAN was heavily loaded. This created a positive feedback loop in delay -- if one passed a critical threshold in network load, systems would try to send out packets, find the line busy, and enter the ethernet "random delay" phase. In the meantime, the kernel would have new packets appearing that it had to deal with -- the line was busy. In the meantime, more packets would be backing up. Very quickly the shared line -- which was designed to be efficient for BURSTY short messages, not saturation -- would wedge, just like a freeway in rush hour, and messages would start taking seconds instead of milliseconds to deliver. The CPU load managing the collisions, retransmissions, and examinations of saturation-load packet traffic would skyrocket. System performance would plummet even on systems not really using the network at the time. NFS delays, etc, would cause user processes to hang. Basically an entire LAN could effectively "crash", although if there weren't timing problems adding to the collision problem (there usually were, because nobody liked the restriction on the radius of the wired LAN and daisy chain networks didn't get retimed except at expensive bridges) in principle it COULD sort itself all out eventually. The moral of the story being that everything consisting of many parts that must move together has scaling limits of sorts, and when it leaves the regime of linear slow-growth scaling and enters the region where nonlinear amplification of random noise occurs, bad things will surely happen. Truthfully, just suppressing sources of randomness won't do it when one enters the REALLY nonlinear region as ANY fluctionation can precipitate a catastrophe (which can be viewed as a dynamic phase transition from an ordered to a disordered state). But even before one gets there, there is a region where fluctuations are amplified with a gradually diverging lifetime -- they don't kick one into a completely dysfunctional state but they can feed back and disorder the extended system far longer than their natural lifetime. Interesting stuff, actually. rgb > > Ashley Pittman. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
- Previous message: [Beowulf] What services do you run on your cluster nodes?
- Next message: [Beowulf] What services do you run on your cluster nodes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
