<div dir="ltr">Referring to lambda functions, I think I flagged up that AWS now supports containers up to 10GB in size for the lambda payload<div><a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/">https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/</a><br></div><div><br></div><div>which makes a Julia language lambda possible <a href="https://www.youtube.com/watch?v=6DvpneWRb_w">https://www.youtube.com/watch?v=6DvpneWRb_w</a></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 4 Feb 2021 at 11:49, Tim Cutts <<a href="mailto:tjrc@sanger.ac.uk">tjrc@sanger.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div style="overflow-wrap: break-word;">

<br>

<div><br>

<blockquote type="cite">

<div>On 4 Feb 2021, at 10:40, Jonathan Aquilina <<a href="mailto:jaquilina@eagleeyet.net" target="_blank">jaquilina@eagleeyet.net</a>> wrote:</div>

<br>

<div><span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:16px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration:none;float:none;display:inline">Maybe

 SETI@home wasnt the right project to mention, just remembered there is another project but not in genomics on that distributed platform called Folding@home.</span></div>

</blockquote>

<div><br>

</div>

<div>Right, protein dynamics simulations like that are at the other end of the data/compute ratio spectrum.  Very suitable for distributed computing in that sort of way.</div>

<br>

<blockquote type="cite">

<div><span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:16px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration:none;float:none;display:inline">So

 with genomics you cannot break it down into smaller chunks where the data can be crunched then returned to sender and then processed once the data is back or as its being received?</span></div>

</blockquote>

</div>

<br>

<div>It depends on what you’re doing.  If you already know the reference genome then, yes you can.  We already do this to some extent; the reads from the sequencing run are de-multiplexed first, and then the reads for each sample are processed as a

 separate embarrassingly parallel job.  This is basically doing a jigsaw puzzle when you know the picture.</div>

<div><br>

</div>

<div>The read alignment to reference (if you already have a standard reference genome) easily decomposable as much as you like, right down to a single read in the extreme case, but the compute for a single read is tiny (this is basically fuzzy grep

 going on here),  and you’d be swamped in scheduling overhead.  For maximum throughput we don’t bother distributing it further, but use multithreading on a single node.</div>

<div><br>

</div>

<div>There have been some interesting distributed mapping attempts, for example decomposing the problem into read groups small enough to fit in the time limit of an AWS lambda function.  You get fabulous turnaround time on the analysis if you do that,

 but you use about four times as much actual compute time as the single node, multi-thread approach we currently use. (reference to the lambda work:  <a href="https://www.biorxiv.org/content/10.1101/576199v1.full.pdf" target="_blank">https://www.biorxiv.org/content/10.1101/576199v1.full.pdf</a>).

 As usual, it all depends on what you’re optimising for, cost, throughput, or turnaround time?</div>

<div><br>

</div>

<div>For some of our projects (Darwin Tree of Life being the prime example), you don’t know what the reference genome looks like.  The problem is still fuzzy grep, but now you’re comparing the reads against each other and looking for overlaps, rather

 than comparing them all independently against the reference.  You’re doing the jigsaw puzzle without knowing the picture.  That’s a bit harder to distribute, and most approaches currently cop out and do it all in single large memory machines.  One way to make

 this easier is to make the reads longer (i.e. make the puzzle pieces larger and fewer of them) which is what sequencing technologies like Oxford Nanopore and PacBio Sequel try to do.  But their throughput is not as high as the short read Illumina approach.</div>

<div><br>

</div>

<div>Some people have taken distributed approaches though (JGI’s MetaHipMer for example:  <a href="https://www.nature.com/articles/s41598-020-67416-5" target="_blank">https://www.nature.com/articles/s41598-020-67416-5</a>).  That’s tackling an even nastier

 problem; simultaneously sequencing many genomes at the same time, for example gut flora from a stool sample, and not only doing

<i>de novo</i> assembly as in the last example, but trying to do so when you don’t know how many different genomes you have in the sample.  So now you have multiple jigsaw puzzles mixed up in the same box, and you don’t know any of the pictures.  And

 of course you have multiple strains, so some of those puzzles have the same picture but 1% of the pieces are different, and you need to work out which is which.</div>

<div><br>

</div>

<div>Fun fun fun!</div>

<div><br>

</div>

<div>Tim</div>

<div><br>

</div>

<div><br>

</div>

-- 

 The Wellcome Sanger Institute is operated by Genome Research 

 Limited, a charity registered in England with number 1021457 and a 

 company registered in England with number 2742969, whose registered 

 office is 215 Euston Road, London, NW1 2BE. 

</div>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

</blockquote></div>