[Beowulf] Project Heron at the Sanger Institute [EXT]

Thu Feb 4 12:06:59 UTC 2021

Referring to lambda functions, I think I flagged up that AWS now supports
containers up to 10GB in size for the lambda payload
https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/

which makes a Julia language lambda possible
https://www.youtube.com/watch?v=6DvpneWRb_w

On Thu, 4 Feb 2021 at 11:49, Tim Cutts <tjrc at sanger.ac.uk> wrote:

>
>
> On 4 Feb 2021, at 10:40, Jonathan Aquilina <jaquilina at eagleeyet.net>
> wrote:
>
> Maybe SETI at home wasnt the right project to mention, just remembered there
> is another project but not in genomics on that distributed platform called
> Folding at home.
>
>
> Right, protein dynamics simulations like that are at the other end of the
> data/compute ratio spectrum.  Very suitable for distributed computing in
> that sort of way.
>
> So with genomics you cannot break it down into smaller chunks where the
> data can be crunched then returned to sender and then processed once the
> data is back or as its being received?
>
>
> It depends on what you’re doing.  If you already know the reference genome
> then, yes you can.  We already do this to some extent; the reads from the
> sequencing run are de-multiplexed first, and then the reads for each sample
> are processed as a separate embarrassingly parallel job.  This is basically
> doing a jigsaw puzzle when you know the picture.
>
> The read alignment to reference (if you already have a standard reference
> genome) easily decomposable as much as you like, right down to a single
> read in the extreme case, but the compute for a single read is tiny (this
> is basically fuzzy grep going on here),  and you’d be swamped in scheduling
> overhead.  For maximum throughput we don’t bother distributing it further,
> but use multithreading on a single node.
>
> There have been some interesting distributed mapping attempts, for example
> decomposing the problem into read groups small enough to fit in the time
> limit of an AWS lambda function.  You get fabulous turnaround time on the
> analysis if you do that, but you use about four times as much actual
> compute time as the single node, multi-thread approach we currently use.
> (reference to the lambda work:
> https://www.biorxiv.org/content/10.1101/576199v1.full.pdf). As usual, it
> all depends on what you’re optimising for, cost, throughput, or turnaround
> time?
>
> For some of our projects (Darwin Tree of Life being the prime example),
> you don’t know what the reference genome looks like.  The problem is still
> fuzzy grep, but now you’re comparing the reads against each other and
> looking for overlaps, rather than comparing them all independently against
> the reference.  You’re doing the jigsaw puzzle without knowing the
> picture.  That’s a bit harder to distribute, and most approaches currently
> cop out and do it all in single large memory machines.  One way to make
> this easier is to make the reads longer (i.e. make the puzzle pieces larger
> and fewer of them) which is what sequencing technologies like Oxford
> Nanopore and PacBio Sequel try to do.  But their throughput is not as high
> as the short read Illumina approach.
>
> Some people have taken distributed approaches though (JGI’s MetaHipMer for
> example:  https://www.nature.com/articles/s41598-020-67416-5).  That’s
> tackling an even nastier problem; simultaneously sequencing many genomes at
> the same time, for example gut flora from a stool sample, and not only
> doing *de novo* assembly as in the last example, but trying to do so when
> you don’t know how many different genomes you have in the sample.  So now
> you have multiple jigsaw puzzles mixed up in the same box, and you don’t
> know any of the pictures.  And of course you have multiple strains, so some
> of those puzzles have the same picture but 1% of the pieces are different,
> and you need to work out which is which.
>
> Fun fun fun!
>
> Tim
>
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210204/8987464e/attachment.htm>