[Beowulf] Accelerator for data compressing
kyron at neuralbs.com
Tue Oct 7 11:07:14 PDT 2008
Dmitri Chubarov wrote:
> we have got a VX50 down here as well. We have observed very different
> scalability on different applications. With an OpenMP molecular
> dynamics code we have got over 14 times speedup while on a 2D finite
> difference scheme I could not get far beyond 3 fold.
2D finite difference can be comm intensive is the mesh is too small for
each processor to have a fair amount of work to do before needing the
neighboring values from a "far" node.
> On Tue, Oct 7, 2008 at 10:45 PM, Eric Thibodeau <kyron at neuralbs.com> wrote:
>> PS: Interesting figures, I couldn't resist compressing the same binary DB on
>> a 16Core Opteron (Tyan VX50) machine and was dumbfounded to get horrible
>> results given the same context. The processing speed only came up to 6.4E6
>> bytes/sec ...for 16 cores, and they were all at 100% during the entire run
>> (FWIW, I tried different block sizes and it does have an impact but this
>> also changes the problem parameters).
> Reading your message in the Beowulf list I should say that it looks
> interesting and probably shows something happening with the memory
> access on the NUMA nodes. Did you try to run the archiver with
> different affinity settings?
I don't have affinity control over the app per say. I would have to
look/modify pbzip's code. Although, note that the PID's assignment to
one processor is governed by the kernel and is thus a scheduler issue.
Also note that I have noticed that the kernel doesn't just have fun
moving the processes around the cores.
> We have observed that the memory architecture shows some strange
> behaviour. For instance the latency for a read from the same NUMA node
> for different nodes varies significantly.
This is the nature of NUMA. Furthermore, if you have to cross to a far
CPU, the latency is also dependent on the CPU's load.
> Also on the profiler I often see that x86 instructions that have one
> of the operands in memory may
> take disproportionally long. I believe that could explain the 100% CPU
> load reported by the kernel.
How do you identify the specific instruction using a profiler, this is
something that interests me.
> From the very little knowledge of this platform that we have got, I
> tend to advise the users not to expect good speedup on their
> multithreaded applications.
Using OpenMP (from GCC 4.3.x) and an embarrassingly parallel problem
(computing K-Means on a large database), I do get significant speedup
> Yet it would be interesting to get a
> better understanding of the programming techniques for this sedecimus
> and the similar machines.
OpenMP is IMHO the easiest one that will bring you the most performance
out of 3 lines of #pragma directives. If you manage to get a cluster of
VX50s, then learn a bit of MPI to glue all of this together ;)
> Even more so due to the QPI systems becoming
> commercially available very soon.
Don't know that one (QPI)...oh...new Intel stuff...no matter how much I
try to stay ahead, I'm always years behind!
> At the moment we have got a few
> small kernels written in C and Fortran with OpenMP that we use to
> evaluate different parallelization strategies. Unfortunately, there
> are no tools I would know of that could help to explain what's going
> on inside the memory of this machine.
Of course, check out TAU (
http://www.cs.uoregon.edu/research/tau/home.php ), it will at least help
you identify bottlenecks and give you an impressive profiling
> I am very much interested to hear more about your experience with VX50.
> Best regards,
> Dima Chubarov
> Dmitri Chubarov
> junior researcher
> Siberian Branch of the Russian Academy of Sciences
> Institute of Computational Technologies
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf