[Beowulf] Large amounts of data to store and process

Jonathan Engwall engwalljonathanthereal at gmail.com
Mon Mar 4 06:10:47 PST 2019


What does your overall design look like?

On Mon, Mar 4, 2019, 5:19 AM Jonathan Aquilina <jaquilina at eagleeyet.net>
wrote:

> Hi Michael,
>
> As previously mentioned we don’t really need to have anything indexed so I
> am thinking flat files are the way to go my only concern is the performance
> of large flat files. Isnt that what HDFS is for to deal with large flat
> files.
>
> On 04/03/2019, 14:13, "Beowulf on behalf of Michael Di Domenico" <
> beowulf-bounces at beowulf.org on behalf of mdidomenico4 at gmail.com> wrote:
>
>     even though you've alluded to this being time series data.  is there a
>     requirement that you have to index into the data or is just read the
>     data end-to-end and do some calculations.
>
>     i routinely face these kind of issues, but we're not indexing into the
>     data, so having things in hdfs or rdbms doesn't give us any benefit.
>     we pull all the data into organized flat files and blow through them
>     with HTCondor.  if the researcher wants to tweak the code they do and
>     then just rerun the whole simulation.
>
>     sometimes that's minutes sometimes days.  but in either case the time
>     to develop code is always much shorter because the data is in flat
>     files and easier for my "non-programmer" programmers.  no need to
>     learn hdfs/hadoop or sql
>
>     if you need to index the data and jump around, hdfs is probably still
>     not the best solution unless you want index the files and 250gb isn't
>     really big enough to warrant an hdfs cluster.  i've generally found
>     unless you're dealing with multi-TB+ datasets you can't scale the
>     hardware out enough to get the speed up.  (yes, i know there are
>     tweaks to change this, but I've found its just simpler to buy a bigger
>     lustre system)
>
>
>
>     On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina
>     <jaquilina at eagleeyet.net> wrote:
>     >
>     > Good Morning all,
>     >
>     >
>     >
>     > I am working on a project that I sadly cant go into much detail but
> there will be quite large amounts of data that will be ingested by this
> system and would need to be efficiently returned as output to the end user
> in around 10 min or so. I am in discussions with another partner involved
> in this project about the best way forward on this.
>     >
>     >
>     >
>     > For me given the amount of data (and it is a huge amount of data)
> that an RDBMS such as postgresql would be a major bottle neck. Another
> thing that was considered flat files, and I think the best for that would
> be a Hadoop cluster with HDFS. But in the case of HPC how can such an
> environment help in terms of ingesting and analytics of large amounts of
> data? Would said flat files of data be put on a SAN/NAS or something and
> through an NFS share accessed that way for computational purposes?
>     >
>     >
>     >
>     > Regards,
>     >
>     > Jonathan
>     >
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190304/ce92a4a3/attachment.html>


More information about the Beowulf mailing list