[Beowulf] distributions

Mon Feb 6 21:08:02 PST 2006

Donald Becker wrote:

>On Fri, 3 Feb 2006, Geoff Jacobs wrote:
>
>  
>
>>Robert G. Brown wrote:
>>    
>>
>>>Note well that for most folks, once you get a cluster node installed and
>>>working, there is little incentive to EVER upgrade the OS except for
>>>administrative convenience. We still have (I'm ashamed to say) nodes
>>>running RH 7.3. Why not? They are protected behind a firewall, they
>>>are stable (obviously:-), they work.
>>>      
>>>
>>Hmm... crispy on the outside, tender on the inside.
>>
>>You have to have an OS installed on your nodes for which security
>>updates are available
>>    
>>
>
>I'll disagree with that point.
>The Scyld Beowulf system introduced a unique design that has "slave 
>nodes" with nothing installed and no network file system required.
>This avoids most security problems, including the need for updates,
>because the vulnerable services and programs just aren't there.
>
>Only master nodes have an OS distribution.  The other nodes are 
>server slaves or compute slave.  At boot time they are given a kernel by a 
>master, queried by the master for their installed hardware, and then given
>device drivers to load.  When it's time to start a server or compute 
>process, they are told to the versions of libraries and executable for 
>the process that they should cache or reuse from a hidden area.
>
>The hidden area is usually a ramdisk file system.  The objects there are 
>only libraries and executables, tracked by version information,  so 
>that they may be reused and shared.
>
>One way to think of the result is a dynamically generated minimal 
>distribution that has only the specific kernel, device drivers,
>application libraries and executables needed to run an application or 
>service.  But it's even simpler than that, since the node doesn't even 
>have initialization and configuration files.
>
>  
>
>>down. You should know as well as I do that your users are scientists and
>>academics, they are not security professionals. They'll pick bad
>>passwords, log in from Winblows terminals which have been infected with
>>virae, keyloggers, etc. In short, you lower the security of the system
>>to that of your least secure user.
>>
>>At least if your systems are updated, the chance of an attack escalating
>>priviledge and making the situation serious is small. You can give Mr.
>>Sloppy The Lecture, restore his files from backup, and be on your merry way.
>>    
>>
>
>Or you can have compute nodes with only the application running.  No 
>daemons are running to break into, and there are no local 
>system configuration files to hide persistent changes e.g. cracks that 
>start up after a reboot.
>
>You still have to keep masters updated, but they are few (just enough for 
>redundancy) and in an internet server environment they don't need to be 
>exposed to the outside world ("firewalled" by the single-purpose slaves 
>with zero installations).
>
>
>A final note: The magic of a dynamic "slave node" architecture isn't a 
>characteristic of the mechanism used to start or control processes.  It 
>does interact with the process subsystem -- it needs to know what to cache 
>(including the specific version) to start up processes accurately.  But the 
>other details, such as the library call interface, process control, 
>security mechanism, naming and cluster membership are almost unrelated.
>
>Nor does "ramdisk root" give you the magic.  A ramdisk root is part of how 
>we implement the architecture, especially the part about not requiring local 
>storage or network file systems to work.  (Philosophy: You mount file 
>systems as needed for application data, not for the underlying system.)  
>But to be fast, efficient and effective the ramdisk can't just be a 
>stripped-down full distribution.  You need a small 'init' system and 
>a dynamic, version-based caching mechanism.  Otherwise you end up with 
>lots of wasted memory, version skew and still have a crippled compute node 
>environment.
>
>  
>
Last time I used Scyld, it involved kickstart floppies, RARP bootup, and
a patch to apply against MPICH for recompilation with the PGI compilers.
They don't make 'em like they used to, thank God.

I've always thought of Scyld as more like one single appliance than a
cluster. If you're updating the master node, you're sort of updating
everything. Kernel being used on the slaves turns out to be exploitable,
update the master, reboot the nodes. Problem solved.

-- 
Geoffrey D. Jacobs
MORE CORE AVAILABLE, BUT NONE FOR YOU.