Linux magic wand - was Re: [Beowulf] Re: "hobbyists"

Tue Jun 24 18:11:06 PDT 2008

Mark Hahn wrote:
>
> so the question is, if you had a magic wand, what would you change in 
> the kernel (or perhaps libc or other support libs, etc)?  most of the 
> things I can think of are not clear-cut.  I'd like to be able to give 
> better info from perf counters to our users (but I don't think Linux 
> is really in the way).  I suspect we lose some performance due to jitter
> injected by the OS (and/or our own monitoring) and would like to improve,
> but again, it's hard to blame Linux.  I'd love to have better options 
> for cluster-aware filesystems.  kernel-assisted network shared memory?
> _______________________________________________
There's a good rant to be written for Usenix or the Ottowa Linux 
Symposium I suspect.

VM - 4096 is small now.  In 1976 a page was 512 bytes.  It moved to 4096 
in the mid '90s?  I forget.
Since then computers and memory bandwidths are much bigger and faster.  
The telling point for me was that I took
a look at a running system and there were only a couple of <hundred> VM 
areas in service, so page breakage
amounts to almost nothing.  We run with 64K pages and plan to experiment 
with much larger ones.

One could argue about thread stacks, but I think that threads and HPC 
don't mix well, so there won't be that
many.  I am aware of the great debate about the right way to program 
high core-count nodes, but I
doubt that more threads than processors is the right answer.

Linux also has pretty poor mechanisms for keeping physical memory 
contiguous, the slabs tend to
get fragmented, which is why the big page stuff and things like 
bigphysarea get preallocated.  There's
no good reason why you couldn't compact memory on the fly.

The VM system is also in the way of OS bypass RDMA NICs - you either get 
large kernel patches
like Quadrics to let virtual RDMA work, or you get pinning and 
registration and other performance
sapping cruft.  The new external-pager stuff may help a lot here, I 
haven't looked at it yet.

I/O system

The block device layer has 512 byte sectors wired in, and is solely 
useful for devices that you own
exclusively.   You've got queueing going on at multiple levels, I think 
because the architecture has
assumptions about cpu/disk performance ratios baked in.  And the 
segments of a bio have to complete
in order, what's that about?  A little one we ran into here is that the 
block I/O system doesn't know
if an I/O is to satisfy an I stream page fault or a D stream page 
fault.  Consequently if your L1 Icache
is not coherent (and few are) you have to flush it on all read 
completions.  A little book keeping would
solve that. (I hope I am wrong about this one!)

File systems

Agree complelely about cluster aware FS.  We struggle with the Lustre 
patch sets, which may be an
extreme case.

Performance stuff

We are big users of the PAPI infrastructure, which is pretty good, but 
once you step off that train you
have to deal with things like sysfs.  So we're trying to read hardware 
counters without undue disturbance
to running HPC applications, and the advice of Linux is to make a system 
call for each value, converted to
ascii.  This makes sense for slow admin stuff but not for performance 
data.  At least it isn't XML.

Runtime system

I tend towards thinking we would be better off without shared 
libraries.  Memory is big, programs are
generally small.  There is a lot of complexity here, to which I am 
allergic.  To the extent that shared
libraries make the program slower (due to separate segments for library 
data, for example), lets
get rid of them.  Two arguments in favor are when the library is 
implementing a system service
chosen by the admin, rather than the programmer (PAM modules), and there 
is this talk about
MPI ABIs, so applications can use alternate packages without relinking.  
I think that is a bad idea
too, but it is off-topic.

OS noise

This becomes a big issue in large systems.  There's way too much stuff 
running in linux, each piece
separately designed, each thread with its own notions of timing and 
periodic wakeups.  Maybe the OS
should run on a separate node altogether, and you communicate with it 
via RDMA.   All that is
left behind is maybe memory management.

-L