[Beowulf] Computation on the head node

Mon May 19 07:14:48 PDT 2008

Perry E. Metzger wrote:
> "Jeffrey B. Layton" <laytonjb at charter.net> writes:
>   
>> Here comes the $64 question - how do you benchmark the IO portion of
>> your code so you can understand whether you need a parallel file
>> system, what kind of connection do you need from a client to the
>> storage, etc. This is a difficult problem and one in which I have an
>> interest.
>>     
>
> This is straightforward, though not easy to explain compactly.  The
> key is to know how to run tools like top, vmstat, etc. and read
> them.
>   

Sorry - perhaps I wasn't clear enough. I'm looking at just the
IO portion of the code - not the overall picture (although the overall
picture is probably more important). I'm assuming you know your
to some degree and realize that you need to understand the impact
of IO performance on the overall IO performance.

> Third, you could be doing lots of file i/o to legitimate data
> files. Here again, it is possible that if the files are small enough
> and your access patterns are repetitive enough that increasing your
> RAM could be enough to make everything fit in the buffer cache and
> radically lower the i/o bandwidth. On the other hand, if you're
> dealing with files that are tens or hundreds of gigabytes instead of
> tens of megabytes in size, and your access patterns are very
> scattered, that clearly isn't going to help and at that point you need
> to improve your I/O bandwidth substantially.
>   

It's never this simple - never :)  Plus, different file systems will impact
the IO performance in different ways. It's never as simple, as "add more
memory" or "need more bandwidth". You need to understand your IO
pattern and what the code is doing. In addition, this can be a function of
the problem size and the the number of processes and their layout.

>> The best way I've found is to look a the IO pattern of your
>> code(s). The best I've found to do this is to run an strace against
>> the code. I've written an strace analyzer that gives you a
>> higher-level view of what's going on with the IO.
>>     
>
> That will certainly give you some idea of access patterns for case 3
> (above), but on the other hand, I've gotten pretty far just glancing
> at the code in question and looking at the size of my files.
>   

But what if don't have access to the source or can't share the source
with vendors (of the data set)?

> I have to say, though, that really dumb measures (like increasing the
> amount of RAM available for buffer cache -- gigs of memory are often a
> lot cheaper than a fiber channel card -- or just having a fast and
> cheap local drive for intermediate data i/o) can in many cases make
> the problem go away a lot better than complicated storage back end
> hardware can.
>   

IMHO and experience many times just adding memory can't make things
go away. Parallel file systems may or may not be able to use the buffer
cache on a single node - it really depends. The IO pattern also has a huge
impact on whether the cache helps much or not. In addition, if users
know that a node has 8 GB of memory but you set up queues to only
allow 4 GB, I will be willing to bet they find a way to use 8GB :)

> If you really are hitting disk and can't help it, a drive on every
> node means a lot of spindles and independent heads, versus a fairly
> small number of those at a central storage point. 200 spindles always
> beat 20.
>   

What if you need to share data across the nodes? Having data spread out
all over creation doesn't help. In addition, I like to get drives out of
the nodes if I can to help increase MTTI.

> In any case, let me note the most important rule: if your CPUs aren't
> doing work most of the time, you're not allocating resources
> properly. If the task is really I/O bound, there is no point in having
> more CPU than I/O can possibly manage. You're better off having 1/2th
> the number of nodes with gargantuan amounts of cache memory than
> having CPUs that are spending 80% of their time twiddling their
> thumbs. The goal is to have the CPUs crunching 100% of the time, and
> if they're not doing that, you're not doing things as well as you can.
>   

I absolutely disagree. I can name many examples where the code has
to do a fair amount of IO as part of the computation so you have to
write data. Doing this in an efficient manner is pretty damn important.
Understanding the IO pattern of your code to help you chose  the
underlying hardware and file system is absolutely critical.

I can also think of examples where you can't stuff enough memory
in a box so you will have to consider IO as part of the computation.

I believe you're thinking of local IO - like a desktop. I'm thinking of
a cluster with a centralized file system and, perhaps, some local IO
for codes that need it.

>> I'm also working on a tool that can take the strace output and
>> create a "simulator" that will run in a similar manner to the
>> original code but actually perform the IO of the original code using
>> dummy data. This allows you to "give" away a simple dummy code to
>> various HPC storage vendors and test your application.  This code is
>> taking a little longer than I'd hoped to develop :(
>>     
>
> It sounds cool, but I suspect that with even simpler tools you can
> probably deduce most of what is going on and get around it.
>   

If you know a better way - let's hear it! I haven't seen one yet and having
worked for an HPC storage company I haven't seen one from them either.
I'm always looking for better techniques but I have to tell, I'm really
skeptical of your ideas.

Jeff