[Beowulf] Project Heron at the Sanger Institute [EXT]

Thu Feb 4 14:22:48 UTC 2021

Dear all,

chiming in here, as I am supporting some of the Covid-19 sequencing at my 
current work place:

We got much smaller sequencing machines and here you can do the analysis on 
the machine as well. The problem we are facing is simply the storage capacity. 
It probably does not come to a surprise this project kicked off during 
lockdown and I only got involved in it end of last year. We actually shuffle 
our data off to a QNAP and then do the analysis on a directly attached Linux 
machine. That means, the sequencing machine has always enough capacity to 
store more raw data. 

One of the things I heard a few times is the use of GPUs for the analysis. Is 
that something you are doing as well? Also, on the topic of GPUs (and being a 
bit controversial): are there actually any programs out there which are not 
using nVidia GPUs and use the AMD ones for example?

> Scientists are conservative folks though, they sometimes get a bit nervous
> at the thought of discarding the raw sequence data.

Wearing my science-hat: I can see why. Traditionally, things like sequencing 
(or NMR files) are expensive to do: you need to have the right sample and the 
right machine for it. I am sometimes coming back to NMR spectra and have a new 
look at them, simply to see if there is something I missed the first time 
around as I did not know about it. With the Covid-19 variants: I guess it is 
the same here. If you only keep the result: yes, it contains Covid-19 or not, 
you cannot go back and re-do it for a different strain/variant. I also dare to 
say that we might need to re-analyse the raw data to make sure a new pipeline 
is producing the same results for example. So for me, there are good reasons 
to keep at least the relevant raw data. 

On a related subject: do you attach meta-data to your data so you can find 
relevant things quicker? I am trying to encourage my users to think about that 
so we can use things like IRODS for example to get a better data management. 
If I may ask: how does the Sanger does that and would be some kind of best 
practice, so it already not exist, be a good idea here?

All the best from a wet London

Jörg

Am Donnerstag, 4. Februar 2021, 10:35:22 GMT schrieb Tim Cutts:
> Compute capacity is not generally the issue.  For this pipeline, we only
> need about 200 cores to keep up with each sequencer, so a couple of
> servers.   Genomics has not, historically, been a good fit for SETI at home
> style cycle-stealing, because the amount of compute you perform on a given
> unit of data is quite low.  A lot of genomics is already I/O bound even
> when the compute is right next to the data, so you don’t gain much by
> shipping it off to cycle-stealing desktops.

> In fact, the direction most sequencing instrument suppliers are going is
> embedding the compute in the sequencer itself, at least for use cases where
> you don’t really need the sequence at all, you just need to know how it
> varies from a reference genome.  In such cases, it’s much more sensible to
> run the pipeline on or right next to the sequencer and just spit out the
> (very small) diffs.

> Scientists are conservative folks though, they sometimes get a bit nervous
> at the thought of discarding the raw sequence data.

> Tim
> 
> On 4 Feb 2021, at 10:27, Jonathan Aquilina
> <jaquilina at eagleeyet.net<mailto:jaquilina at eagleeyet.net>> wrote:

> Would love to help you guys out in anyway i can in terms of hardware
> processing.

> Have you guys thought of doing something like SETI at home and those projects
> to get idle compute power to help churn through the massive amounts of
> data?

> Regards,
> Jonathan
> ________________________________
> From: Tim Cutts <tjrc at sanger.ac.uk<mailto:tjrc at sanger.ac.uk>>
> Sent: 04 February 2021 11:26
> To: Jonathan Aquilina
> <jaquilina at eagleeyet.net<mailto:jaquilina at eagleeyet.net>>
 Cc: Beowulf
> <beowulf at beowulf.org<mailto:beowulf at beowulf.org>>
> Subject: Re: [Beowulf] Project Heron at the Sanger Institute [EXT]
> 
> 
> 
> On 4 Feb 2021, at 10:14, Jonathan Aquilina via Beowulf
> <beowulf at beowulf.org<mailto:beowulf at beowulf.org>> wrote:

> I am curious though to chunk out such large data is something like
> hadoop/HBase and the like of those platforms, are those whats being used?

> 
> It’s a combination of our home-grown sequencing pipeline which we use across
> the board, and then a specific COG-UK analysis of the genomes themselves. 
> This pipeline is common to all consortium members who are contributing
> sequence data.  It’s a Nextflow pipeline, and the code is here:

> https://github.com/connor-lab/ncov2019-artic-nf
> [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.co
> m_connor-2Dlab_ncov2019-2Dartic-2Dnf&d=DwMF-g&c=D7ByGjS34AllFgecYw0iC6Zq7qlm
> 8uclZFI0SqQnqBo&r=gSesY1AbeTURZwExR_OGFZlp9YUzrLWyYpGmwAw4Q50&m=jJhOeZORmye7
> vKliXyqrCd2Kvbe5xu9pHhLw4rNQmHM&s=lSbHd9Jxd4Dy9P7rosnrdgOmieVt-yzUuVI-MPK7TM
> 0&e=>

> Being nextflow, you can run it on anything for which nextflow has a backend
> scheduler.   It supports data from both Illumina and Oxford Nanopore
> sequencers.

> Tim
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.