[Beowulf] Large amounts of data to store and process
John Hearns
hearnsj at googlemail.com
Tue Mar 5 06:07:42 PST 2019
Talking about missing values... Joe Landman is sure to school me again
for this one (owwwccchhh)
https://docs.julialang.org/en/v1/manual/missing/index.html
Going back to the hardware, a 250Gbyte data size is not too large to hold
in RAM.
This might be a good use case for Intel Optane persistent memory - I dont
know exactly how this works when used in a memory mode as opposed to a
block device mode.
The Diablo memory was supposed to migrate cold pages down to the lower,
slower memory.
Does Optane function similarly?
On Tue, 5 Mar 2019 at 01:02, Lux, Jim (337K) via Beowulf <
beowulf at beowulf.org> wrote:
> I'm munging through not very much satellite telemetry (a few GByte), using
> sqlite3..
> Here's some general observations:
> 1) if the data is recorded by multiple sensor systems, the clocks will
> *not* align - sure they may run NTP, but....
> 2) Typically there's some sort of raw clock being recorded with the data
> (in ticks of some oscillator, typically) - that's what you can use to put
> data from a particular batch of sources into a time order. And then you
> have the problem of reconciling the different clocks.
> 3) Watch out for leap seconds in time stamps - some systems have them
> (UTC), some do not (GPS, TAI) - a time of 23:59:60 may be legal.
> 4) you need to have a way to deal with "missing" data, whether it's time
> tags, or actual measurements - as well as "gaps in the record"
> 5) Be aware of the need to de-dupe data - same telemetry records from
> multiple sources.
>
>
> Jim Lux
> (818)354-2075 (office)
> (818)395-2714 (cell)
>
>
> -----Original Message-----
> From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jonathan
> Aquilina
> Sent: Monday, March 04, 2019 1:24 AM
> To: Fred Youhanaie <fly at anydata.co.uk>; beowulf at beowulf.org
> Subject: Re: [Beowulf] Large amounts of data to store and process
>
> Hi Fred,
>
> I and my colleague had done some research and found an extension for
> postgresql called timescaleDB, but then upon further research postgres on
> its own is good for such data as well. The thing is these are not going to
> be given to use as the data is coming in but in bulk at the end from the
> parent company.
>
> Have you used postgresql for such type's of data and how has it performed?
>
> Regards,
> Jonathan
>
> On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie" <
> beowulf-bounces at beowulf.org on behalf of fly at anydata.co.uk> wrote:
>
> Hi Jonathan,
>
> It seems you're collecting metrics and time series data. Perhaps a
> time series database (TSDB) is an option for you. There are a few of these
> out there, but I don't have any personal recommendation.
>
> Cheers,
> Fred
>
> On 04/03/2019 07:04, Jonathan Aquilina wrote:
> > These would be numerical data such as integers or floating point
> numbers.
> >
> > -----Original Message-----
> > From: Tony Brian Albers <tba at kb.dk>
> > Sent: 04 March 2019 08:04
> > To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net>
> > Subject: Re: [Beowulf] Large amounts of data to store and process
> >
> > Hi Jonathan,
> >
> > From my limited knowledge of the technologies, I would say that
> HBase with file pointers to the files placed on HDFS would suit you well.
> >
> > But if the files are log files, consider some tools that are suited
> for analyzing those like Kibana.
> >
> > /tony
> >
> >
> > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
> >> Hi Tony,
> >>
> >> Sadly I cant go into much detail due to me being under an NDA. At
> this
> >> point with the prototype we have around 250gb of sample data but
> again
> >> this data is dependent on the type of air craft. Larger aircraft and
> >> longer flights will generate a lot more data as they have more
> >> sensors and will log more data than the sample data that I have. The
> >> sample data is 250gb for 35 aircraft of the same type.
> >>
> >> Regards,
> >> Jonathan
> >>
> >> -----Original Message-----
> >> From: Tony Brian Albers <tba at kb.dk>
> >> Sent: 04 March 2019 07:48
> >> To: beowulf at beowulf.org; Jonathan Aquilina <jaquilina at eagleeyet.net
> >
> >> Subject: Re: [Beowulf] Large amounts of data to store and process
> >>
> >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
> >>> Good Morning all,
> >>>
> >>> I am working on a project that I sadly cant go into much detail but
> >>> there will be quite large amounts of data that will be ingested by
> >>> this system and would need to be efficiently returned as output to
> >>> the end user in around 10 min or so. I am in discussions with
> >>> another partner involved in this project about the best way forward
> >>> on this.
> >>>
> >>> For me given the amount of data (and it is a huge amount of data)
> >>> that an RDBMS such as postgresql would be a major bottle neck.
> >>> Another thing that was considered flat files, and I think the best
> >>> for that would be a Hadoop cluster with HDFS. But in the case of
> HPC
> >>> how can such an environment help in terms of ingesting and
> analytics
> >>> of large amounts of data? Would said flat files of data be put on a
> >>> SAN/NAS or something and through an NFS share accessed that way for
> >>> computational purposes?
> >>>
> >>> Regards,
> >>> Jonathan
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >>> Computing To change your subscription (digest mode or unsubscribe)
> >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
> >>
> >> Good morning,
> >>
> >> Around here, we're using HBase for similar purposes. We have a bunch
> >> of smaller nodes storing the data and all the management
> nodes(ambari,
> >> HDFS namenodes etc.) are vm's.
> >>
> >> Our nodes are configured so that we have a maximum of 2 cores per
> disk
> >> spindle and 4G of memory for each core. This seems to do the trick
> and
> >> is pretty responsive.
> >>
> >> But to be able to provide better advice, you will probably need to
> go
> >> into a bit more detail about what types of data you will be storing
> >> and which kind of calculations you want to perform.
> >>
> >> /tony
> >>
> >>
> >> --
> >> Tony Albers - Systems Architect - IT Development Royal Danish
> Library,
> >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
> >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> >
> > --
> > Tony Albers - Systems Architect - IT Development Royal Danish
> Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
> > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190305/9a3cc04d/attachment-0001.html>
More information about the Beowulf
mailing list