[Beowulf] single machine with 500 GB of RAM

Wed Jan 9 16:47:36 PST 2013

On 01/09/2013 05:53 PM, Mark Hahn wrote:
> given the "slicing" methodology that this application uses for
> decomposition, I wonder whether it's actually closer to sequential
> in its access patterns, rather than random.  the point here is that
> you absolutely must have ram if your access is random, since your
> only constraint is latency.  if a lot of your accesses are sequential,
> then they are potentially much more IO-like - specifically disk-like.

This is a great point, but I'd caution against (not suggesting you are) 
jumping straight to the conclusion that because a data space can be 
sliced it must be sequential in nature.  I worked a bit with Orca 
(Gaussian-type of chem app) in my undergrad and found some of the 
sub-applications in the project were near-EP and sliced up cleanly to be 
merged later, but still random in nature during the bulk of the 
execution.  They just split the data space into "cubes" of sorts and 
distribute those bounds (and sometimes the associated data) to the slave 
procs.  Within each process the accesses to their "cube" of data were 
near to completely random.

This is a great example of why one should trace the application with a 
wide array of real-life parameters before committing to the purchase. 
Even something as trivial as dstat run while the application chugged 
along would give you a good hint at where the bottlenecks exist and what 
you want to spend your cash on.

>> have proper memory, it isn't optimized, and as a result you're
>> constantly swapping.  Merges are a good example of what /should/ work
>
> if the domain is sliced in the "right" direction, merging should be
> very efficient.  even if sliced in the wrong direction, merging should
> at least be block-able (and thus not terrible.)

Ah, excellent point!

>> merging on just one of them that is also outfitted with a ramdisk'd 0.5
>> TB Fusion-IO PCI-E flash device.  If I am not wildly off the mark on the
>
> I wouldn't bother with PCI-E flash, myself.  they tend to have
> dumb/traditional raid controllers on them.  doing raid0 across a
> handful of cheap 2.5" SATA SSDs is ridiculously easy to do and will
> scale well up fairly well (with some attention to the PCIE topology
> connecting the controllers, of course.)

My attention to them has revolved more around the latency issues that 
arise when you add a HW raid controller and the SATA protocol between 
the CPU and the SSD.  This is exacerbated if the access patterns do not 
remain completely random but converge over the lifetime of the run -- in 
that case you may run into the accesses localizing onto a single SSD in 
the RAID array and bottlenecking performance towards the tail-end of the 
execution.  This of course could happen to a certain extent on a PCI-E 
device as well, but it's somewhat less likely since those fancy devices 
tend to do migration if a channel/package/die becomes massively 
overworked.  Maybe there are raid controllers that do this as well, I'm 
unsure.

I saw the bit on toms hardware about scaling SSDs for throughput and 
IOPs, but is there anywhere a scaling study on them where individual 
requests latencies are measured and CDF'd?  That would be really 
interesting to me, because although IOP/s is a nice term to throw 
around, if your application is waiting on a particular request to come 
back it doesn't really matter how many other IO's are completing in the 
interim -- you need that specific request to complete and to complete 
quickly to continue chugging along.

Best,

ellis