From scott at trinitygames.com Tue Feb 8 13:36:11 2005 From: scott at trinitygames.com (Scott Taylor) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] SCYLD/MOSIX for Game Server SSI/Process Migration? Message-ID: <0d8701c50e26$36ef3680$4a36fea9@tradmin> Hi, I'm new to the list, first post. Thank you for allowing me to post. We've begun the long and painful process of exploring various cluster solutions. We run online game servers, so for us, our applications are: - Many separate serial tasks - Varying load per task - Task/job/applications run all the time We know we won't gain many of the benefits people generally expect from parallel systems. This is OK for us. Our primary objective is to increase efficient use of server space by load balancing on the fly. i.e. SSI with Process Migration. Our game server daemons, like all others, start with a burst of CPU to setup, then are generally idle, until players join, and then CPU increases with player load. This is where we hope process migration will help us to better utilize our serverspace. We don't know in advance which of the hundreds of game servers we operate will actually have player counts and when, so a single system image with process migration among many nodes appears ideal for us. With separate systems and manual load balancing, we end up with many idle systems and some that are indeed overloaded. So far, the cluster solutions we've studied are: MOSIX SCYLD If SCYLD can do what we want, and is affordable, it sure leems like the obvious choice due to apparent ease of installation and configuration. However, a major concern is that a migration event will cause a "lag spike" on the game server daemon being migrated or other gaming processes on the system -- this is a real show stopper for game servers, and our users would not tolerate it. Our processes can be compared to near real-time applications like streaming video or audio, and any hiccup is very noticeable. In a paper written in Nov. 2002, Carlo Daffara raises this issue, and overcomes the problem by using iproute2 queue controls. Here is an excerpt from the writing: http://www.democritos.it/events/openMosix/papers/Openmosix4n.pdf "Another problem appeared during testing: since the game server memory footprint is large (around 80 Mbytes each), we discovered that the migration of processes slowed down the remaining network activity, introducing significant packet latency (especially perceptible, since packets are very small). So, we used the linux iproute2 queue controls to establish a stochastic fair queuing discipline to the ethernet channels used for internode communications; this works by creating a set of network "bins" that host the individual network flows, marked using hashes generated from the originating and destination IP addresses and the other part of the traffic header. The individual bins are then emptied in round robin, thus prioritizing small packets over large transfer and not penalizing large transfers (like process migration)." So, the questions raised so far in our quest are: - Does Scyld support process migration and load balance like [MOSIX]? - Will the process migration event cause a hiccup as described by Daffara? - Does our GigE network [help to] overcome this problem? - Is it necessary (or even possible) to use the iproute2 queue controls on SCYLD? I certainly would appreciate anyone's input on any of these or other related issues. This is our available test hardware: Head/Master: Twin Xeon 2.8, 2G, 80G SATA primary for root/boot, some big RAID for the 'common' filesystem (tbd). Nodes: P4 3.0/800 1G, Diskless. PXE/Gigabit NIC. Network: Dedicated Gig. Switch, GigE/PXE in every node. We haven't installed any O/S yet. I'm still trying to find out how to obtain SCYLD. We are waiting for an answer from an email sent to the email address on the site which is supposed to be emailed to find vendors. Thank you, --- Scott Taylor Network Administrator Trinity Gaming From billk at metrumrg.com Wed Feb 9 08:51:40 2005 From: billk at metrumrg.com (Bill Knebel) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] running non-mpi programs on scyld cluster Message-ID: <420A3F9C.9010300@metrumrg.com> I have a program that is called from a perl batch script. The program is non-MPI aware so I have been using mpprun to execute the perl program. The perl program can start from 1 - x processes depending upon the arguments to the batch file. I currently call the batch file as: mpprun -no-local perl batch.p 1 2 3 & 1 2 3 cause the perl program to start proceses 1 2 and 3 in three different directories. (The different directories are necessary because of the nature of the program being run.) The results is three processes all running on one node. (Each node has two processors and there are 3 nodes for now for a total of 6 processors.) I have tried supplying the -np x option but this simply starts starts the same three processes over an another node once the initial three processes are complete. The same thing occurs if I use the -map x:x:x option. I have also tried batching the commant via the "batch now" interactive command line interface and the results is the same. Is there anyway to indicate to the cluster to load balance these processes across the nodes? Or do I need to start each process with a seperate mpprun command? Also, it appears that the NO_LOCAL=1 option does not work with the "Batch" command. Does that seem correct? The cluster consists of a dual processor (2 Xeon's) master node with three compute nodes each with 2 Xeon processors. Eventually we will have a number of additional nodes up but I am testing for now. Any help would be greatly appreciated. Regards, Bill -- Bill Knebel, PharmD, Ph.D. Principal Scientist Metrum Research Group 15 Ensign Drive Avon, CT 06001 email: billk@metrumrg.com tel: (860) 930-1370 From jkrauska at cisco.com Thu Feb 24 02:08:05 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] IPC Network vrs Management Network Message-ID: <421DA785.4020208@cisco.com> I am exploring using my cluster with a different IPC networks. IE: Using MPI RNICs or InfiniBand in my cluster. Some design questions: 1. Should the head node also be connected to the high-speed IPC network? 2. How do I tweak the /etc/beowulf/config file to support this? 3. Is it possible to pxeboot/dhcp on one interface, but issue bproc starts over the high speed interface? It seems benchmarks like hpl (linpack) issue lots of Master->Slave communication in their normal operation. (This as opposed to pallas, which seems to do a lot of slave<->slave communication..) This seems to imply that Linpack is somewhat bound to your rsh/ssh/bproc choice of spawning mpi apps. Which seems flawed to me as it's not stressing mpi in this way. (comments?) The above seems to encourage using a higher speed interconnect from the head node to issue the bproc calls. (leaving the normal ethernet only for pxe and "management things" like stats) The "interface" keyword in the config coupled with the "pxeinterface" keyword seems to encourage this type of setup, but I find that if "interface foo" is set, the pxeserver doesn't want to restart if the iprange command doesn't map to the IP subnet on interface foo. (suggesting that the dhcp functionality wants to bind to "foo" and not the given pxeinterface) Thus "interface" must be the pxeinterface. (maybe someone's not parsing the pxeinterface command?) Does anyone here have a successful cluster where the head node is connected to both the high-speed IPC network and the "management" network? Thanks, --joel From becker at scyld.com Thu Feb 24 15:01:23 2005 From: becker at scyld.com (Donald Becker) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] IPC Network vrs Management Network In-Reply-To: <421DA785.4020208@cisco.com> References: <421DA785.4020208@cisco.com> Message-ID: On Thu, 24 Feb 2005, Joel Krauska wrote: > I am exploring using my cluster with a different IPC networks. > IE: Using MPI RNICs or InfiniBand in my cluster. > > Some design questions: > > 1. Should the head node also be connected to the high-speed IPC network? The Scyld software is designed to work with either configuration, but there are significant advantages to putting the master on the high speed network. A few of them are - A master is the only type of node that deals with compute node additions. Some networks, such as Myrinet, require a re-mapping process when new machines are added. That's easily done by the master when it's part of the network and difficult otherwise. Why do other cluster systems not consider this an issue? Almost all other cluster systems assume a fixed cluster configuration, with hand modification (or equivalently, custom scripts) needed when anything changes. You can use Scyld in that way, but it discards the advantage of incremental, on-line scalability possible with clusters. - The master can monitor, manage and control the network. A Beomap plug-in can use the network statistics to create a better schedule. - Some MPI programs, especially those converted from PVM, expect the rank 0 process to be able to do I/O. This expectation is reflected in the default Beomap scheduler, which puts the first process on the master. (See the beomap manual and "--no-local" to change this - Our preferred MPI library implementation is a true library. A single process runs on the master until the MPI initialization call. The MPI initialization function creates the remote processes with a remote fork system call. This approach copies the initialization to the remote processes exactly. Most or all other cluster MPI implementations start all processes simultaneously with an auxiliary program, usually a script named 'mpirun' or 'mpiexec'. This means that the process count is fixed The single reason for not putting the master on a high speed network is - Most switches have even port counts e.g. 16, 64 or 128 ports, and many applications want to run on a power-of-two processor count. The next switch size up often costs more than twice as much. [[ This is very appealing for pricing optimization, but consider that the first failure removes this advantage. ]] There are many other reasons, which I can (and will) go on about at length over a beer. Almost all of the reason are summarized as "Yes, we can do things the way everyone else does, but that would be throwing out advances that I personally consider really important for usability or performance". > 2. How do I tweak the /etc/beowulf/config file to support this? You may not need to do anything, except set the PXE server interface. See the PXE parameter page in the Beosetup->Settings menu. This sets the "pxeinterface" line in /etc/beowulf/config. There is an opportunity that most users are not aware of. The 'transportmode' keyword controls the underlying caching filesystem. The default system uses TCP/IP/Ethernet to cache library and executables from the master. By plugging in different "get-file" programs you can tune the system to make caching faster or more efficient. By changing the boot parameters those get-file programs can be directed to use an alternate server, rather than the primary master. > 3. Is it possible to pxeboot/dhcp on one interface, but issue bproc > starts over the high speed interface? That's exactly what the 'pxeinterface' configuration setting does. The original motivation was working with the Chiba City cluster at Argonne, which had Myrinet on 32 out of every 33 nodes (uhhhhgggg -- an example of "don't do this"). A later reinforcing motivation was motherboards with that had both Fast and Gigabit Ethernet ports, but only PXE booted off of the Fast Ethernet. > It seems benchmarks like hpl (linpack) issue lots of Master->Slave > communication in their normal operation. (This as opposed to pallas, > which seems to do a lot of slave<->slave communication..) > > This seems to imply that Linpack is somewhat bound to your rsh/ssh/bproc > choice of spawning mpi apps. Which seems flawed to me as it's not > stressing mpi in this way. (comments?) > > The above seems to encourage using a higher speed interconnect > from the head node to issue the bproc calls. (leaving the normal > ethernet only for pxe and "management things" like stats) If job spawning performance is a concern, there are better solutions that we have prototyped. The current method is generate a map for process placement, map[0] .. map[N-1] the parent process remote forks to node map[1] .. map[N-1] if map[0] is not the master, the parent process moves to node map[0] A slight variation is generate a map for process placement, map[0] .. map[N-1] if map[0] is not the master, the parent process moves to node map[0] the parent process remote forks to node map[1] .. map[N-1] else the parent process remote forks to node map[1] that child remote forks to node map[2] .. map[N-1] > The "interface" keyword in the config coupled with the > "pxeinterface" keyword seems to encourage this type of setup, > but I find that if "interface foo" is set, the pxeserver doesn't > want to restart if the iprange command doesn't map to the IP > subnet on interface foo. (suggesting that the dhcp functionality > wants to bind to "foo" and not the given pxeinterface) That's not quite the way it works. interface # Sets the cluster private network pxeinterface # Enables true PXE If the pxeinterface keyword is not used, the PXE server reverts to DHCP+TFTP on the cluster interface. This is slightly different than true PXE. It passes most of the PXE options to the client, but doesn't use an intermediate "port 4011" agent. If is the same , the server behavior changes to true PXE protocol on the cluster private network. If is a different interface, that interface is assumed to be up and assigned a valid IP address. That IP address is typically *not* in the cluster IP address range. A source of confusion here is a decision we made several years ago to switch which part of the system controls network interfaces. Originally our tools handled the network interface configuration, including setting the IP address info and bringing the interface up and down. This worked very well. Over time the de facto "standard" Linux tools became increasingly insistent on managing the network interface settings. Those administrative tools really, *really* want to control every network connection, especially the automatic start-up, shutdown and firewall configuration. Thus our infrastructure now has to assume that the interfaces are correctly configured for the cluster, and can only log a complaint and quit if they are not configured or inconsistent. > Thus "interface" must be the pxeinterface. (maybe someone's not parsing > the pxeinterface command?) The only known bug in this area is that older versions of Beosetup would fail to read in an existing pxeinterface specification. You would have to set it by hand each time you started Beosetup. That bug is now fixed. BTW, a few of the PXE-related keywords are: nodeassign pxefileserver pxebootfile [NODE-RANGE] pxebootcfgfile [NODE-RANGE] pxebootdir pxeinterface pxebandwidth Syntax: pxefileserver Use SERVERIP as the machine that serves the boot images. Syntax: pxebootfile [NODE-RANGE] pxebootcfgfile [NODE-RANGE] Use BOOTFILE as the node bootstrap program or configuration file. An unspecified NODE-RANGE means use this BOOTFILE as the default. Syntax: pxebootdir Use DIR as the root directory for the TFTP file server. Syntax: pxeinterface Use IFNAME as the network interface. Note that this must be a physical interface, as it watches for broadcast packets and responds with broadcast packets. Syntax: pxefileserver Specify an alternate boot file server. This instructs the booting machine to retrieve all subsequent files from the machine . This is typically used only in very large cluster configurations, where the network load of booting machine may interfere with this master's operation. Syntax: pxebandwidth Limit the bandwidth used by the boot subsystem to the integer value . Note that this is in bits, not bytes. This value should not be set unless there is a specific performance problem noted while groups of new nodes are booting. Donald Becker becker@scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993