<div dir="ltr"><div>Roland, the OpenHPC integration IS interesting.</div><div>I am on the OpenHPC list and look forward to the announcement there.<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 17 May 2018 at 15:00, Roland Fehrenbacher <span dir="ltr"><<a href="mailto:rf@q-leap.de" target="_blank">rf@q-leap.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">>>>>> "J" == Lux, Jim (337K) <<a href="mailto:james.p.lux@jpl.nasa.gov">james.p.lux@jpl.nasa.gov</a>> writes:<br>
<br>
J> The reason I hadn't looked at "diskless boot from a<br>
J> server" is the size of the image - assume you don't have a high<br>
J> bandwidth or reliable link.<br>
<br>
This is not something to worry about with Qlustar. A (compressed)<br>
Qlustar 10.0 image containing e.g. the core OS + slurm + OFED + Lustre is<br>
just a mere 165MB to be transferred (eating 420MB of RAM when unpacked<br>
as the OS on the node) from the head to a node. Qlustar (and its<br>
non-public ancestors) were never using anything but RAMDisks (with real<br>
disks for scratch), the first cluster running this at the end of 2001 was on<br>
Athlons ... and eaten-up RAM in the range of 100MB still mattered a lot<br>
at that time :)<br>
<br>
So over the years, we perfected our image build mechanism to achieve a<br>
close to minimal (size-wise) OS, minimal in the sense of: Given required<br>
functionality (wanted kernel modules, services, binaries/scripts, libs),<br>
generate an image (module) of minimal size providing it. That is maximum<br>
light-weight by definition.<br>
<br>
Yes, I know, you'll probably say "well, but it's just Ubuntu ...". Not for<br>
much longer though: CentOS support (incl. OpenHPC integration) coming<br>
very soon ... And all Open-Source and free.<br>
<br>
Best,<br>
<br>
Roland<br>
<br>
-------<br>
<a href="https://www.q-leap.com" rel="noreferrer" target="_blank">https://www.q-leap.com</a> / <a href="https://qlustar.com" rel="noreferrer" target="_blank">https://qlustar.com</a><br>
--- HPC / Storage / Cloud Linux Cluster OS ---<br>
<br>
J> On 5/12/18, 12:33 AM, "Beowulf on behalf of Chris Samuel"<br>
J> <<a href="mailto:beowulf-bounces@beowulf.org">beowulf-bounces@beowulf.org</a> on behalf of <a href="mailto:chris@csamuel.org">chris@csamuel.org</a>><br>
J> wrote:<br>
<br>
J> On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K)<br>
<span class=""> J> wrote:<br>
<br>
>> While I’d never claim my pack of beagles is HPC, it does share<br>
>> some aspects – there’s parallel work going on, the nodes need to<br>
>> be aware of each other and synchronize their behavior (that is,<br>
>> it’s not an embarrassingly parallel task that’s farmed out from a<br>
>> queue), and most importantly, the management has to be scalable.<br>
>> While I might have 4 beagles on the bench right now – the idea is<br>
>> to scale the approach to hundreds. Typing “sudo apt-get install<br>
>> tbd-package” on 4 nodes sequentially might be ok (although pdsh<br>
>> and csshx help a lot) it’s not viable for 100 nodes.<br>
<br>
</span> J> At ${JOB-1) we moved to diskless nodes and booting RAMdisk<br>
J> images from the management node back in 2013 and it worked<br>
J> really well for us. You no longer have the issue about nodes<br>
J> getting out of step because one of them was down when you ran<br>
J> your install of a package across the cluster, removed HDD<br>
J> failures from the picture (though that's likely less an issue<br>
J> with SSDs these days) and did I mention the peace of mind of<br>
J> knowing everything is the same? :-)<br>
<br>
J> It's not new, the Blue Gene systems we had (BG/P 2010-2012<br>
J> and BG/Q 2012-2016) booted RAMdisks as they were designed to<br>
J> scale up to huge systems from the beginning and to try and<br>
J> remove as many points of failure as possible - no moving<br>
J> parts on the node cards, no local storage, no local state,<br>
<br>
J> Where I am now we're pretty much the same, except instead of<br>
J> booting a pure RAM disk we boot an initrd that pivots onto an<br>
J> image stored on our Lustre filesystem instead. These nodes<br>
J> do have local SSDs for local scratch, but again no real local<br>
J> state.<br>
<br>
J> I think the place where this is going to get hard is on the<br>
J> application side of things, there were things like<br>
J> Fault-Tolerant MPI (which got subsumed into Open-MPI) but it<br>
J> still relies on the applications being written to use and<br>
J> cope with that. Slurm includes fault tolerance support too,<br>
J> in that you can request an allocation and should a node fail<br>
J> you can have "hot-spare" nodes replace the dead node but<br>
J> again your application needs to be able to cope with it!<br>
<br>
J> It's a fascinating subject, and the exascale folks have been<br>
J> talking about it for a while - LLNL's Dona Crawford keynote<br>
J> was about it at the Slurm User Group in 2013 and is well<br>
J> worth a read.<br>
<br>
J> <a href="https://slurm.schedmd.com/SUG13/keynote.pdf" rel="noreferrer" target="_blank">https://slurm.schedmd.com/<wbr>SUG13/keynote.pdf</a><br>
<br>
J> Slide 21 talks about the reliability/recovery side of things:<br>
<br>
J> # Mean time between failures of minutes or seconds for<br>
J> # exascale<br>
J> [...]<br>
J> # Need 100X improvement in MTTI so that applications can run<br>
J> # for many hours. Goal is 10X improvement in hardware<br>
J> # reliability. Local recovery and migration may yield another<br>
J> # 10X. However, for exascale, applications will need to be<br>
J> # fault resilient<br>
<br>
J> She also made the point that checkpoint/restart doesn't<br>
J> scale, you will likely end up spending all your compute time<br>
J> doing C/R at exascale due to failures and never actually<br>
J> getting any work done.<br>
<div class="HOEnZb"><div class="h5">______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div>