[Beowulf] Xgrid and Mosix (fwd from john@rudd.cc)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Eugen Leitl eugen at leitl.orgThu Dec 30 04:20:59 PST 2004
- Previous message: [Beowulf] OpenMP/PBS tuning
- Next message: [Beowulf] mpirun -nolocal option seems no to be working
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
----- Forwarded message from John Rudd <john at rudd.cc> ----- From: John Rudd <john at rudd.cc> Date: Wed, 29 Dec 2004 19:37:02 -0800 To: xgrid-users at lists.apple.com Subject: Xgrid and Mosix X-Mailer: Apple Mail (2.619) I see in the archives that someone asked about OpenMosix back in September ( http://lists.apple.com/archives/xgrid-users/2004/Sep/msg00023.html ), but I didn't see any responses. So I thought I'd ask too, but with a little more detail. The thing that I find interesting about the Mosix style distributed computing environment is that applications do NOT need to be re-written around them. Mosix abstracts the distributed computing cluster away from the program and developer in the same way that threads abstract multi-processing away from the program and developer. Under Mosix, any program, without having to be written around any special library, without having to be relinked or recompiled, can be moved off to another processing node if there are nodes that are significantly less busy than yours. And, AFAIK, any multi-threaded application can make use of multiple nodes (with threads being spawned on any host that is less loaded than the current node). Imagine taking a completely mundane but multi-threaded application (I'll assume Photoshop is multi-threaded and use that as an example). Suddenly, without having to get Adobe to support Xgrid, you can use Xgrid to speed up your Photoshop rendering. It seems to me that a similar set of features could be added to Xgrid. The threading and processing spawning code within the kernel could be extended by Xgrid to check for lightly loaded Agents, and move the new process or thread to that Agent. Only the IO routines would need to exist on the Client (and even then, maybe not: if every node has similar filesystem image, then only the UI (for user bound applications) or primary network interface code (for network daemons/servers) needs to run on the original Client system). From what I recall, the mach microkernel already makes some infrastructure for this type of thing available, it just needs to be utilized, and done deep enough in the kernel that an application doesn't need to know about it. Though, that does bring up one consideration: I have a friend who did a lot of distributed computing work when he was working for Flying Crocodile (a web hosting company that specialized in porn sites, where his distributed computing code had to support multiple-millions of hits per second). His experience there gave him a concern about Mosix style distributed computing. One of the advantages of something like Beowulf is that the coder often needs to control what things need to be kept low latency (must use threads for SMP on the local processor) and what things can have high latency (can use parallel code on the network), and the programming interface type of distributed computing gives them that flexibility. The idea that I suggested was something like nice/renice in unix, where you could specify certain parallelism parameters to a process before you run it, or after it is already running. For example, instead of "process priority", you might specify a sort of "process affinity" or "thread affinity". For process affinity, a low number (which means high affinity, just like priority and nice numbers) means "when this process creates a child, it must be kept close to the same CPU as the one that spawned it". Thread affinity would be the same, but for threads. A default of zero means "everything must run locally". A high number means "I can tolerate more latency" (so, "latency tolerance" would be the opposite of "affinity"). (it occurs to me after I wrote all of this that it might be easier for the end user to think in terms of "latency tolerance" instead of "process affinity", high latency = high number, instead of the opportunity for confusion that affinity has since the numbers go in the opposite direction ... I hope all of that made sense) A process with a low process affinity (high number) and a high thread affinity (low number) means that it can spawn new tasks/processes/applications anywhere in the network, but any threads for it (or its sub-processes) must exist on the same node as its main thread. Or, if you want all of the applications to be running on your workstation/Client, but run their threads all over the network, then you set a high process affinity (low number), and a low thread affinity (high number). I would have the xgrid command line tool have such a facility (I don't know if it does already or not, I haven't really done much with xgrid) similar to both the "nice" and "renice" commands. I would also add a preference pane that allows the user to set a default process affinity, a default thread affinity, and a list of applications and default affinities for each of those applications (so that they can be exceptions to the default, without the user having to set it via command line every time). Last, I would add a tool, possibly attached to the Xgrid tachometer, which would allow me to adjust an affinity after a program was running. The only thing up in the air is the ability to move a running thread from one node to another while it's running (well, during a context switch, really). I know a friend of mine at Ga Tech was doing PhD research on that (portable threads) 10ish years ago, but I don't know if it got anywhere. But, that would allow someone to lower the number of an application's affinity while it's running, thus recalling the threads or processes from a remote Agent to the local Client (the scenario being I have a laptop that is an Xgrid Client, and I start running applications that spread out across the network ... then I get up to leave, so I lower the affinity numbers of everything so that the tasks and threads come back to my laptop, running slower now that they have fewer nodes to run upon, but still running (or sleeping, as the case might be)). So ... all of that leads up to: does anyone know if Xgrid is working on this type of Application-Transparent Distributed Computing that Mosix, OpenMosix, and I think OpenSSI have? I think it would be a natural extension to Xgrid: Apple is trying to make this as "it just works" as possible, so it seems that it should not only be easy for the sysadmin to set up the distributed computing cluster, but easy/transparent for the developer, too (in the same way that threads made Multi-Processing easier and more abstract for the developer, this type of distributed computing makes threads not just a multi-processing model, but a distributed computing model). Ultimately, it even makes distributed computing easy for the user: they don't need to learn how to re-code a program (or coerce a vendor into making a distributed version of their application), any multi-threaded application will use multiple nodes, and even single-threaded non-distributed applications can be run on remote nodes. That seems like a powerful "it just works" capability to me. (the main drawback of Mosix, OpenMosix, and OpenSSI from my perspective is that they're Linux only, specifically developed for the Linux kernel ... but I'd really love to see something like them available for Mac OS X) Thoughts? _______________________________________________ Do not post admin requests to the list. They will be ignored. Xgrid-users mailing list (Xgrid-users at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/xgrid-users/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041230/ed307a9e/attachment.bin
- Previous message: [Beowulf] OpenMP/PBS tuning
- Next message: [Beowulf] mpirun -nolocal option seems no to be working
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
