[Beowulf] Generation of strings MPI fashion..

Sun Oct 9 21:56:21 PDT 2016

Sent from my iPhone

> On 9 Oct 2016, at 11:05 pm, John Hearns <John.Hearns at xma.co.uk> wrote:
> 
> This sounds interesting.
> 
> I would question this step though, as it seems intrinically a bottlneck:
>> Removes duplicate lines:
>> $ sort filename.txt | uniq
> 
> First off - do you need a sorted list?  If not do not perform the sort...
> 
> Secondly - how efficient is the Unix 'uniq' utility?

The command could be collapsed to

sort -u

Gnu sort is quite efficient I think.

> How about, every time you generate a string, check if it has already been generated.  I guess this then starts to depend on the efficiency of searching through a file.

Assuming you're using a binary tree to store those, you're effectively sorting the list anyway. Probably with O(N log N) execution time. 

> Thinking about this problem, would it be better to allocate each MPI rank a unique set of string to generate?
> Say there are N characters in the set A-Z a-z 0-9 plus symbols . This is likely to be 255 as I'm assuming you are talking the ASCII character set.
> So then if you have 256  MPI ranks then allocate rank 1 to generate all strings commencing with 'A'  then rank 2 'B' etc.
> If you are lucky enough to have 256x256 cores to allocate then it goes 'AA' 'AB' 'AC'  etc. etc.

That's no longer random though is it? You're absolutely fixing the frequency of the first letter. The more you do that, the less random your strings will become. 

Tim

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.