Linux magic wand - was Re: [Beowulf] Re: "hobbyists"
Lawrence Stewart
larry.stewart at sicortex.com
Tue Jun 24 18:11:06 PDT 2008
Mark Hahn wrote:
>
> so the question is, if you had a magic wand, what would you change in
> the kernel (or perhaps libc or other support libs, etc)? most of the
> things I can think of are not clear-cut. I'd like to be able to give
> better info from perf counters to our users (but I don't think Linux
> is really in the way). I suspect we lose some performance due to jitter
> injected by the OS (and/or our own monitoring) and would like to improve,
> but again, it's hard to blame Linux. I'd love to have better options
> for cluster-aware filesystems. kernel-assisted network shared memory?
> _______________________________________________
There's a good rant to be written for Usenix or the Ottowa Linux
Symposium I suspect.
VM - 4096 is small now. In 1976 a page was 512 bytes. It moved to 4096
in the mid '90s? I forget.
Since then computers and memory bandwidths are much bigger and faster.
The telling point for me was that I took
a look at a running system and there were only a couple of <hundred> VM
areas in service, so page breakage
amounts to almost nothing. We run with 64K pages and plan to experiment
with much larger ones.
One could argue about thread stacks, but I think that threads and HPC
don't mix well, so there won't be that
many. I am aware of the great debate about the right way to program
high core-count nodes, but I
doubt that more threads than processors is the right answer.
Linux also has pretty poor mechanisms for keeping physical memory
contiguous, the slabs tend to
get fragmented, which is why the big page stuff and things like
bigphysarea get preallocated. There's
no good reason why you couldn't compact memory on the fly.
The VM system is also in the way of OS bypass RDMA NICs - you either get
large kernel patches
like Quadrics to let virtual RDMA work, or you get pinning and
registration and other performance
sapping cruft. The new external-pager stuff may help a lot here, I
haven't looked at it yet.
I/O system
The block device layer has 512 byte sectors wired in, and is solely
useful for devices that you own
exclusively. You've got queueing going on at multiple levels, I think
because the architecture has
assumptions about cpu/disk performance ratios baked in. And the
segments of a bio have to complete
in order, what's that about? A little one we ran into here is that the
block I/O system doesn't know
if an I/O is to satisfy an I stream page fault or a D stream page
fault. Consequently if your L1 Icache
is not coherent (and few are) you have to flush it on all read
completions. A little book keeping would
solve that. (I hope I am wrong about this one!)
File systems
Agree complelely about cluster aware FS. We struggle with the Lustre
patch sets, which may be an
extreme case.
Performance stuff
We are big users of the PAPI infrastructure, which is pretty good, but
once you step off that train you
have to deal with things like sysfs. So we're trying to read hardware
counters without undue disturbance
to running HPC applications, and the advice of Linux is to make a system
call for each value, converted to
ascii. This makes sense for slow admin stuff but not for performance
data. At least it isn't XML.
Runtime system
I tend towards thinking we would be better off without shared
libraries. Memory is big, programs are
generally small. There is a lot of complexity here, to which I am
allergic. To the extent that shared
libraries make the program slower (due to separate segments for library
data, for example), lets
get rid of them. Two arguments in favor are when the library is
implementing a system service
chosen by the admin, rather than the programmer (PAM modules), and there
is this talk about
MPI ABIs, so applications can use alternate packages without relinking.
I think that is a bad idea
too, but it is off-topic.
OS noise
This becomes a big issue in large systems. There's way too much stuff
running in linux, each piece
separately designed, each thread with its own notions of timing and
periodic wakeups. Maybe the OS
should run on a separate node altogether, and you communicate with it
via RDMA. All that is
left behind is maybe memory management.
-L
More information about the Beowulf
mailing list