[Beowulf] Large amounts of data to store and process

John Hearns hearnsj at googlemail.com
Tue Mar 5 06:07:42 PST 2019


Talking about missing values...   Joe Landman is sure to school me again
for this one (owwwccchhh)
https://docs.julialang.org/en/v1/manual/missing/index.html

Going back to the hardware, a 250Gbyte data size is not too large to hold
in RAM.
This might be a good use case for Intel Optane persistent memory - I dont
know exactly how this works when used in a memory mode as opposed to a
block device mode.
The Diablo memory was supposed to migrate cold pages down to the lower,
slower memory.
Does Optane function similarly?





On Tue, 5 Mar 2019 at 01:02, Lux, Jim (337K) via Beowulf <
beowulf at beowulf.org> wrote:

> I'm munging through not very much satellite telemetry (a few GByte), using
> sqlite3..
> Here's some general observations:
> 1) if the data is recorded by multiple sensor systems, the clocks will
> *not* align - sure they may run NTP, but....
> 2) Typically there's some sort of raw clock being recorded with the data
> (in ticks of some oscillator, typically) - that's what you can use to put
> data from a particular batch of sources into a time order.  And then you
> have the problem of reconciling the different clocks.
> 3) Watch out for leap seconds in time stamps - some systems have them
> (UTC), some do not (GPS, TAI) - a time of 23:59:60 may be legal.
> 4) you need to have a way to deal with "missing" data, whether it's time
> tags, or actual measurements - as well as "gaps in the record"
> 5) Be aware of the need to de-dupe data - same telemetry records from
> multiple sources.
>
>
> Jim Lux
> (818)354-2075 (office)
> (818)395-2714 (cell)
>
>
> -----Original Message-----
> From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jonathan
> Aquilina
> Sent: Monday, March 04, 2019 1:24 AM
> To: Fred Youhanaie <fly at anydata.co.uk>; beowulf at beowulf.org
> Subject: Re: [Beowulf] Large amounts of data to store and process
>
> Hi Fred,
>
> I and my colleague had done some research and found an extension for
> postgresql called timescaleDB, but then upon further research postgres on
> its own is good for such data as well. The thing is these are not going to
> be given to use as the data is coming in but in bulk at the end from the
> parent company.
>
> Have you used postgresql for such type's of data and how has it performed?
>
> Regards,
> Jonathan
>
> On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>
>     Hi Jonathan,
>
>     It seems you're collecting metrics and time series data. Perhaps a
> time series database (TSDB) is an option for you. There are a few of these
> out there, but I don't have any personal recommendation.
>
>     Cheers,
>     Fred
>
>     On 04/03/2019 07:04, Jonathan Aquilina wrote:
>     > These would be numerical data such as integers or floating point
> numbers.
>     >
>     > -----Original Message-----
>     > From: Tony Brian Albers <tba at kb.dk>
>     > Sent: 04 March 2019 08:04
>     > To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net>
>     > Subject: Re: [Beowulf] Large amounts of data to store and process
>     >
>     > Hi Jonathan,
>     >
>     >  From my limited knowledge of the technologies, I would say that
> HBase with file pointers to the files placed on HDFS would suit you well.
>     >
>     > But if the files are log files, consider some tools that are suited
> for analyzing those like Kibana.
>     >
>     > /tony
>     >
>     >
>     > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
>     >> Hi Tony,
>     >>
>     >> Sadly I cant go into much detail due to me being under an NDA. At
> this
>     >> point with the prototype we have around 250gb of sample data but
> again
>     >> this data is dependent on the type of air craft. Larger aircraft and
>     >> longer flights will generate a lot more data as they have  more
>     >> sensors and will log more data than the sample data that I have. The
>     >> sample data is 250gb for 35 aircraft of the same type.
>     >>
>     >> Regards,
>     >> Jonathan
>     >>
>     >> -----Original Message-----
>     >> From: Tony Brian Albers <tba at kb.dk>
>     >> Sent: 04 March 2019 07:48
>     >> To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net
> >
>     >> Subject: Re: [Beowulf] Large amounts of data to store and process
>     >>
>     >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
>     >>> Good Morning all,
>     >>>
>     >>> I am working on a project that I sadly cant go into much detail but
>     >>> there will be quite large amounts of data that will be ingested by
>     >>> this system and would need to be efficiently returned as output to
>     >>> the end user in around 10 min or so. I am in discussions with
>     >>> another partner involved in this project about the best way forward
>     >>> on this.
>     >>>
>     >>> For me given the amount of data (and it is a huge amount of data)
>     >>> that an RDBMS such as postgresql would be a major bottle neck.
>     >>> Another thing that was considered flat files, and I think the best
>     >>> for that would be a Hadoop cluster with HDFS. But in the case of
> HPC
>     >>> how can such an environment help in terms of ingesting and
> analytics
>     >>> of large amounts of data? Would said flat files of data be put on a
>     >>> SAN/NAS or something and through an NFS share accessed that way for
>     >>> computational purposes?
>     >>>
>     >>> Regards,
>     >>> Jonathan
>     >>> _______________________________________________
>     >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>     >>> Computing To change your subscription (digest mode or unsubscribe)
>     >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
>     >>
>     >> Good morning,
>     >>
>     >> Around here, we're using HBase for similar purposes. We have a bunch
>     >> of smaller nodes storing the data and all the management
> nodes(ambari,
>     >> HDFS namenodes etc.) are vm's.
>     >>
>     >> Our nodes are configured so that we have a maximum of 2 cores per
> disk
>     >> spindle and 4G of memory for each core. This seems to do the trick
> and
>     >> is pretty responsive.
>     >>
>     >> But to be able to provide better advice, you will probably need to
> go
>     >> into a bit more detail about what types of data you will be storing
>     >> and which kind of calculations you want to perform.
>     >>
>     >> /tony
>     >>
>     >>
>     >> --
>     >> Tony Albers - Systems Architect - IT Development Royal Danish
> Library,
>     >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     >
>     > --
>     > Tony Albers - Systems Architect - IT Development Royal Danish
> Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
>     > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
>     > _______________________________________________
>     > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>     >
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
>     To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190305/9a3cc04d/attachment-0001.html>


More information about the Beowulf mailing list