[Beowulf] Computation on the head node

Mon May 19 09:44:42 PDT 2008

"Jeffrey B. Layton" <laytonjb at charter.net> writes:
>> Third, you could be doing lots of file i/o to legitimate data
>> files. Here again, it is possible that if the files are small enough
>> and your access patterns are repetitive enough that increasing your
>> RAM could be enough to make everything fit in the buffer cache and
>> radically lower the i/o bandwidth. On the other hand, if you're
>> dealing with files that are tens or hundreds of gigabytes instead of
>> tens of megabytes in size, and your access patterns are very
>> scattered, that clearly isn't going to help and at that point you need
>> to improve your I/O bandwidth substantially.
>
> It's never this simple - never :)

Sometimes it is this simple. Indeed, often it is.

> Plus, different file systems will impact the IO performance in
> different ways.

Well, of course.

> It's never as simple, as "add more memory" or "need more bandwidth".

Sometimes it *is* as simple as "add more memory". I remember one
particular problem I dealt with once where adding about 30% more
memory for file cache nearly eliminated disk i/o, at which point it
was no longer necessary to optimize the i/o subsystem.

If you don't believe that's ever happened, well, fine by me. It won't
hurt me either way. :)

> You need to understand your IO pattern and what the code is doing.

Naturally, but sometimes the solution is as easy as "add more
memory". The best way to improve i/o performance possible is to
eliminate i/o if you can. If you're just spewing data out really,
really fast, memory won't help. If you're reading and writing the same
data, or you're reading a reasonable sized working set, memory helps.

>>> The best way I've found is to look a the IO pattern of your
>>> code(s). The best I've found to do this is to run an strace against
>>> the code. I've written an strace analyzer that gives you a
>>> higher-level view of what's going on with the IO.
>>
>> That will certainly give you some idea of access patterns for case 3
>> (above), but on the other hand, I've gotten pretty far just glancing
>> at the code in question and looking at the size of my files.
>
> But what if don't have access to the source or can't share the source
> with vendors (of the data set)?

Very often you can figure out what the code is doing just by looking
at things like page hit rates from the various status
programs. They'll tell you what your I/O pattern is like.

The vendor sharing issue is, of course, far more
complicated. Everything that involves people and not machines is
pretty much by definition more complicated. :)

>> I have to say, though, that really dumb measures (like increasing the
>> amount of RAM available for buffer cache -- gigs of memory are often a
>> lot cheaper than a fiber channel card -- or just having a fast and
>> cheap local drive for intermediate data i/o) can in many cases make
>> the problem go away a lot better than complicated storage back end
>> hardware can.
>
> IMHO and experience many times just adding memory can't make things
> go away.

If you don't know how to tune the file cache usage, that's certainly
true -- without tuning, you'll never use the extra RAM. I've seen
people who have added more memory and then said "well, see, that did
no good" but they didn't know how to tune their OS correctly for their
job load so of course it wasn't going to do them any good. (There are
also systems where you just can't tune for what you want -- try tuning
a Windows 2000 Server box to use more file cache, for example, and
you'll spend your time tearing your hair out.)

I've found that, remarkably often, more memory *can* make the problem
go away, but only in cases where keeping most files hot in cache can
eliminate the i/o entirely. If you're writing out tens or hundreds of
gigs of generated data, memory alone is clearly not going to fix the
problem. If you are reading primarily, but your working set is much
larger than the amount of memory you could possibly afford, then more
memory is clearly not going to help. However, if you are hitting a hot
half gig or gig of file data for read and write, memory makes all the
difference in the world.

>> If you really are hitting disk and can't help it, a drive on every
>> node means a lot of spindles and independent heads, versus a fairly
>> small number of those at a central storage point. 200 spindles always
>> beat 20.
>
> What if you need to share data across the nodes?

Even again, it depends. If everyone's hitting a bunch of hot data in
one spot, clearly you're going to lose (though even then, maybe having
a ridiculously large ramdisk from a commercial supplier can help). If,
on the other hand, data access patterns are fairly randomly spread,
then it might be a big win to spread the data across the nodes, just
as one can win big by slicing up database table rows across lots of
servers in some applications. "It depends on what you are doing."

> Having data spread out all over creation doesn't help.

It can. Look at how Google does things. They spread their data out
"all over creation", and they win big. They don't use giant file
servers at all -- they spread disk i/o out to hundreds of thousands of
spindles on hundreds of thousands of nodes.

> In addition, I like to get drives out of the nodes if I can to help
> increase MTTI.

That's also a consideration. As always, it is a tradeoff, and
understanding the particular application the cluster is being used for
is key to knowing what the right thing is.

>> In any case, let me note the most important rule: if your CPUs aren't
>> doing work most of the time, you're not allocating resources
>> properly. If the task is really I/O bound, there is no point in having
>> more CPU than I/O can possibly manage. You're better off having 1/2th
>> the number of nodes with gargantuan amounts of cache memory than
>> having CPUs that are spending 80% of their time twiddling their
>> thumbs. The goal is to have the CPUs crunching 100% of the time, and
>> if they're not doing that, you're not doing things as well as you can.
>
> I absolutely disagree.

Chacun a son gout. I was under the impression that in scientific
computing the name of the game was having your computation done as
fast as possible at the lowest possible cost.

If your CPU is idle, why did you pay for it? They're a huge cost
differential these days between fast and slow CPUs. Why didn't you buy
a much cheaper CPU that would remain nearly 100% busy while keeping
the I/O subsystem as fast? You would have saved lots of cash, your job
would be done just as fast, and probably (in a modern system) you
would have saved a whole lot of electricity because slower CPUs eat
fewer Watts.

> I can name many examples where the code has to do a fair amount of
> IO as part of the computation so you have to write data.

Sure, but the name of the game is, wait for I/O as little as
possible. Every moment the CPU is idle it could be doing something
else instead. Precious resource is being wasted.

> Doing this in an efficient manner is pretty damn important.

I believe that's more or less what I've said, yes?

> Understanding the IO pattern of your code to help you chose  the
> underlying hardware and file system is absolutely critical.

I can't say that I disagree there, however, the reason for that is so
that your CPU can spend more of its time working and less twiddling
its virtual thumbs waiting for data to work on.

> I can also think of examples where you can't stuff enough memory
> in a box so you will have to consider IO as part of the computation.

I fully agree. It is trivial to think of examples where you have to
flush out so much data that it is simply impossible to alleviate the
problem with RAM. If your working set is 100G of data, you're not
going to fix that with RAM on individual nodes. However, when you can
fix things with RAM, it is a wonderfully simple and elegant solution.

> I believe you're thinking of local IO - like a desktop.

No, really, I'm not.

>>> I'm also working on a tool that can take the strace output and
>>> create a "simulator" that will run in a similar manner to the
>>> original code but actually perform the IO of the original code using
>>> dummy data. This allows you to "give" away a simple dummy code to
>>> various HPC storage vendors and test your application.  This code is
>>> taking a little longer than I'd hoped to develop :(
>>
>> It sounds cool, but I suspect that with even simpler tools you can
>> probably deduce most of what is going on and get around it.
>
> If you know a better way - let's hear it!

Well, as always, it depends on your particular cluster issue, and I'm
not privy to your job load. :)

> I haven't seen one yet and having worked for an HPC storage company
> I haven't seen one from them either.  I'm always looking for better
> techniques but I have to tell, I'm really skeptical of your ideas.

You needn't listen to me, then. It is a free country, and I won't be
insulted in the least if you ignore me. I'm pretty laid back about
that sort of thing. :)

Perry