[Beowulf] Large amounts of data to store and process

Mon Mar 4 08:39:25 PST 2019

On Mon, Mar 4, 2019 at 8:18 AM Jonathan Aquilina
<jaquilina at eagleeyet.net> wrote:
>
> As previously mentioned we don’t really need to have anything indexed so I am thinking flat files are the way to go my only concern is the performance of large flat files.

potentially, there are many factors in the work flow that ultimately
influence the decision as others have pointed out.  my flat file
example is only one, where we just repeatable blow through the files.

> Isnt that what HDFS is for to deal with large flat files.

large is relative.  256GB file isn't "large" anymore.  i've pushed TB
files through hadoop and run the terabyte sort benchmark, and yes it
can be done in minutes (time-scale), but you need an astounding amount
of hardware to do it (the last benchmark paper i saw, it was something
1000 nodes).  you can accomplish the same feat using less and less
complicated hardware/software

and if your dev's are willing to adapt to the hadoop ecosystem, you
sunk right off the dock.

to get a more targeted answer from the numerous smart people on the
list, you'd need to open up the app and workflow to us.  there's just
too many variables