<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<br class="">

<div><br class="">

<blockquote type="cite" class="">

<div class="">On 4 Feb 2021, at 10:40, Jonathan Aquilina <<a href="mailto:jaquilina@eagleeyet.net" class="">jaquilina@eagleeyet.net</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class=""><span style="caret-color: rgb(0, 0, 0); font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none; float: none; display: inline !important;" class="">Maybe

 SETI@home wasnt the right project to mention, just remembered there is another project but not in genomics on that distributed platform called Folding@home.</span></div>

</blockquote>

<div><br class="">

</div>

<div>Right, protein dynamics simulations like that are at the other end of the data/compute ratio spectrum.  Very suitable for distributed computing in that sort of way.</div>

<br class="">

<blockquote type="cite" class="">

<div class=""><span style="caret-color: rgb(0, 0, 0); font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none; float: none; display: inline !important;" class="">So

 with genomics you cannot break it down into smaller chunks where the data can be crunched then returned to sender and then processed once the data is back or as its being received?</span></div>

</blockquote>

</div>

<br class="">

<div class="">It depends on what you’re doing.  If you already know the reference genome then, yes you can.  We already do this to some extent; the reads from the sequencing run are de-multiplexed first, and then the reads for each sample are processed as a

 separate embarrassingly parallel job.  This is basically doing a jigsaw puzzle when you know the picture.</div>

<div class=""><br class="">

</div>

<div class="">The read alignment to reference (if you already have a standard reference genome) easily decomposable as much as you like, right down to a single read in the extreme case, but the compute for a single read is tiny (this is basically fuzzy grep

 going on here),  and you’d be swamped in scheduling overhead.  For maximum throughput we don’t bother distributing it further, but use multithreading on a single node.</div>

<div class=""><br class="">

</div>

<div class="">There have been some interesting distributed mapping attempts, for example decomposing the problem into read groups small enough to fit in the time limit of an AWS lambda function.  You get fabulous turnaround time on the analysis if you do that,

 but you use about four times as much actual compute time as the single node, multi-thread approach we currently use. (reference to the lambda work:  <a href="https://www.biorxiv.org/content/10.1101/576199v1.full.pdf" class="">https://www.biorxiv.org/content/10.1101/576199v1.full.pdf</a>).

 As usual, it all depends on what you’re optimising for, cost, throughput, or turnaround time?</div>

<div class=""><br class="">

</div>

<div class="">For some of our projects (Darwin Tree of Life being the prime example), you don’t know what the reference genome looks like.  The problem is still fuzzy grep, but now you’re comparing the reads against each other and looking for overlaps, rather

 than comparing them all independently against the reference.  You’re doing the jigsaw puzzle without knowing the picture.  That’s a bit harder to distribute, and most approaches currently cop out and do it all in single large memory machines.  One way to make

 this easier is to make the reads longer (i.e. make the puzzle pieces larger and fewer of them) which is what sequencing technologies like Oxford Nanopore and PacBio Sequel try to do.  But their throughput is not as high as the short read Illumina approach.</div>

<div class=""><br class="">

</div>

<div class="">Some people have taken distributed approaches though (JGI’s MetaHipMer for example:  <a href="https://www.nature.com/articles/s41598-020-67416-5" class="">https://www.nature.com/articles/s41598-020-67416-5</a>).  That’s tackling an even nastier

 problem; simultaneously sequencing many genomes at the same time, for example gut flora from a stool sample, and not only doing

<i class="">de novo</i> assembly as in the last example, but trying to do so when you don’t know how many different genomes you have in the sample.  So now you have multiple jigsaw puzzles mixed up in the same box, and you don’t know any of the pictures.  And

 of course you have multiple strains, so some of those puzzles have the same picture but 1% of the pieces are different, and you need to work out which is which.</div>

<div class=""><br class="">

</div>

<div class="">Fun fun fun!</div>

<div class=""><br class="">

</div>

<div class="">Tim</div>

<div class=""><br class="">

</div>

<div class=""><br class="">

</div>

-- 

 The Wellcome Sanger Institute is operated by Genome Research 

 Limited, a charity registered in England with number 1021457 and a 

 company registered in England with number 2742969, whose registered 

 office is 215 Euston Road, London, NW1 2BE. 

</body>

</html>