[Beowulf] Advice on 4 CPU node configurations

Thu Feb 10 12:21:54 PST 2011

Christopher Samuel wrote:
> I've heard 1TB of RAM quoted by the bioinfomatics people
> as the amount of RAM needed to do de-novo reassembly of
> the human genome with Velvet (which is a single threaded
> application).

That would be an expensive thing to do though, much more efficient to
paint the new reads onto the existing consensus and then go back and
deal with any discrepancies.

Anyway, the Velvet memory usage estimator is described here:

http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-July/000474.html

Velvet is for assembling from short reads.  Short reads are easy to get
in large numbers but are cruddy for fully assembling large genomes de
novo, since genomes are full of high copy DNA longer than the reads.  So
the de novo sequence would come out in a lot of disconnected chunks
which would have to mapped back onto the reference sequence anyway.

Assuming reads of 100 ( bp), a genome of 3000 (in MB, rounded down to
the nearest billion), numreads = genome size/read size * 20/1000000
(reads in millions, 20X over sequenced) plugging into that formula gives:

 -109635 + 18977*100 + 86326*3000 + 233353*(3000000000/100)*20/1000000 -
51092*31

        399194013 Kb

Roughly 381 Gb (if I didn't typo anything).  Most of that is in the
hashes, where each base from a read is included in 31 different hashes,
once in each position 1->31 (the hash is calculated on a sliding window
that increments by 1, not by 31).  Effectively the hash takes the
60Gbases of raw sequence and expands it 31X.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech