[Beowulf] Project Heron at the Sanger Institute [EXT]

Tim Cutts tjrc at sanger.ac.uk
Thu Feb 4 10:21:27 UTC 2021

> On 3 Feb 2021, at 18:23, Jörg Saßmannshausen <sassy-work at sassy.formativ.net> wrote:
> Hi John,
> interesting stuff and good reading. 
> For the IT interests on here: these sequencing machine are chucking out large 
> amount of data per day. The project I am involved in can chew out 400 GB or so 
> on raw data per day. That is a small machine. That then needs to be processed 
> before you actually can analyze it. So there is quite some data movement etc 
> involved here. 

If anyone wants any details, just ask me, since the IT supporting all that sequencing is my team’s baby.

Actually, the sequencing capacity for this volume of COVID samples is not great.  The virus genome is so small (only 30,000 bases, compared to a human’s 3 billion base pairs) that you can massively multiplex the samples in a single sequencing run.

Currently, we multiplex 384 samples per Novaseq sequencing lane.  There are four lanes per flowcell, and two flowcells per sequencer.  The sequencing run takes about 24 hours, so each instrument can sequence about 3,000 samples per day.

We have about 20 of these sequencers, so our total capacity is very high; in fact we only use three sequencers for COVID at the moment, because sample and library preparation is actually the bottleneck.  Getting those 384 samples ready for the sequencer.  We are planning to increase it though, both by increasing multiplexing and by using more sequencers.

Sequencing itself is a bit less than a day, and the computational analysis to de-multiplex and reconstruct the genomes is less than a day running on our production-oriented OpenStack cluster (we keep critical projects like Heron on a physically separate cluster from normal faculty research); we can easily keep up with the sequencers.  We then upload our results to the folks at CLIMB, and that’s where the comparative genomics tends to take place.

There’s a lot of effort at the moment going into speeding up the end-to-end process; for this sequencing to be as useful as possible for close-to-real-time outbreak and mutation analysis, the turnaround time needs to be as short as possible.  It turns out you can see statistically significant new mutation signatures very early on before infection rates really start to rise (this was visible in Kent data for B.1.1.7), so the sooner we can see this sort of thing the better we will get at taking appropriate measures.

For more details on the actual analysis, we released a public seminar a couple of weeks ago:



 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

More information about the Beowulf mailing list