[Beowulf] Project Heron at the Sanger Institute [EXT]

Thu Feb 4 11:49:06 UTC 2021

On 4 Feb 2021, at 10:40, Jonathan Aquilina <jaquilina at eagleeyet.net<mailto:jaquilina at eagleeyet.net>> wrote:

Maybe SETI at home wasnt the right project to mention, just remembered there is another project but not in genomics on that distributed platform called Folding at home.

Right, protein dynamics simulations like that are at the other end of the data/compute ratio spectrum.  Very suitable for distributed computing in that sort of way.

So with genomics you cannot break it down into smaller chunks where the data can be crunched then returned to sender and then processed once the data is back or as its being received?

It depends on what you’re doing.  If you already know the reference genome then, yes you can.  We already do this to some extent; the reads from the sequencing run are de-multiplexed first, and then the reads for each sample are processed as a separate embarrassingly parallel job.  This is basically doing a jigsaw puzzle when you know the picture.

The read alignment to reference (if you already have a standard reference genome) easily decomposable as much as you like, right down to a single read in the extreme case, but the compute for a single read is tiny (this is basically fuzzy grep going on here),  and you’d be swamped in scheduling overhead.  For maximum throughput we don’t bother distributing it further, but use multithreading on a single node.

There have been some interesting distributed mapping attempts, for example decomposing the problem into read groups small enough to fit in the time limit of an AWS lambda function.  You get fabulous turnaround time on the analysis if you do that, but you use about four times as much actual compute time as the single node, multi-thread approach we currently use. (reference to the lambda work:  https://www.biorxiv.org/content/10.1101/576199v1.full.pdf). As usual, it all depends on what you’re optimising for, cost, throughput, or turnaround time?

For some of our projects (Darwin Tree of Life being the prime example), you don’t know what the reference genome looks like.  The problem is still fuzzy grep, but now you’re comparing the reads against each other and looking for overlaps, rather than comparing them all independently against the reference.  You’re doing the jigsaw puzzle without knowing the picture.  That’s a bit harder to distribute, and most approaches currently cop out and do it all in single large memory machines.  One way to make this easier is to make the reads longer (i.e. make the puzzle pieces larger and fewer of them) which is what sequencing technologies like Oxford Nanopore and PacBio Sequel try to do.  But their throughput is not as high as the short read Illumina approach.

Some people have taken distributed approaches though (JGI’s MetaHipMer for example:  https://www.nature.com/articles/s41598-020-67416-5).  That’s tackling an even nastier problem; simultaneously sequencing many genomes at the same time, for example gut flora from a stool sample, and not only doing de novo assembly as in the last example, but trying to do so when you don’t know how many different genomes you have in the sample.  So now you have multiple jigsaw puzzles mixed up in the same box, and you don’t know any of the pictures.  And of course you have multiple strains, so some of those puzzles have the same picture but 1% of the pieces are different, and you need to work out which is which.

Fun fun fun!

Tim

-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210204/275deb61/attachment.htm>