[Beowulf] Southampton's RPi cluster is cool but too many cables?

Tue Sep 25 09:52:08 PDT 2012

On 09/25/2012 12:33 PM, Igor Kozin wrote:
> "this thing" does only ~ 1/20 of the genome. you have to pay quite a
> bit more for your full genome which makes it comparable (price-wise)
> with other technologies. hopefully in a few years time it'll get
> cheaper.
> stored as characters (1 byte per char) the genome is ~ 3 GB. you could
> use two bits to represent the four letter alphabet but probably nobody
> does that. better yet, you can store only the difference against a
> known reference genome or use a dictionary based compression. however
> people like having metadata (e.g. quality) as well so the size varies.
>
> the current bread of sequencers which produce many short reads (70-200
> base pairs) with high coverage require lots of intermediate data
> flying around. indeed a single run can result in ~ 0.5 TB. if you have
> multiple sequencers you better have a lot of storage.
>
> single-strand sequencing technologies promise to be more accurate and
> have long reads. still you don't want to wait a very long time for
> sequencing accurately a single strand so you chop it and do parallel
> processing. a laptop should be able to cope with post processing.

Great info, thanks Igor.  My apologies for using an abstraction ("this 
thing") in place of it's proper name.  Was writing with brevity as I had 
a bunch going on at the time.

I do wonder however if the 7500 bases per second number reported isn't 
misleading marketing (which would devolve your 1/20th figure to 1/40th), 
as with diploid human bases I would expect it to waste half of its time 
looking at something it probably has already seen before.  Maybe there 
is some way to scrape out the duplicates?

Best,

ellis