<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 04/19/2017 03:21 PM, Jörg

      Saßmannshausen wrote:<br>

    </div>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">Hi Prentice,

three questions (not necessarily to you and it can be dealt with in a different 

thread too):

- why automount and not a static mount?</pre>

    </blockquote>

    Well, I've been told that, in general, automounting reduces the

    load(s) on the servers, since the mounts only exist when they are

    needed. I'm somewhat skeptical of that that specific claim myself.

    In this case, these /l/hostname directories aren't used by every

    job, so a majority of the cluster nodes aren't actually serving out

    these dirs over NFS at any given time, so there is some truth to

    that. <br>

    <br>

    Automounting certainly makes life easier when your home directories

    or project directories spread over many different servers. No need

    for  a massive /etc/fstab, and it's easy to move directories from

    server to server when needed without updating the /etc/fstab on

    every single server, so there's that.  <br>

    <br>

    Just about every where I've worked, /home an project/shared

    directories are automounted, and directories like /usr/local are

    statically mounted. <br>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">

- do I get that right that the nodes itself export shares to other nodes?</pre>

    </blockquote>

    Exactly! It's not how I would do it. In fact, I think this is a

    horrible idea, but I inherited it from those who came before me, and

    have to live it now. <br>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">

- has anything changed? I am thinking of something like more nodes added, new 

programs being installed, more users added, generally a higher load on the 

cluster.</pre>

    </blockquote>

    Not on my end. After dealing with this problem for close to weeks,

    it finally came out that use changed his code a few days before

    these problems started, but at the moment, there's no evidence that

    that change broke things, I rebooted all the nodes yesterday as a

    'hail mary', and jobs have been running just fine ever since, so

    that's an important clue to this mystery (some sort of resource

    exhaustion?)<br>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">

One problem I had in the past with my 112 node cluster where I am exporting 

/home, /opt and one directory in /usr/local to all the nodes from the headnode 

was that the NFS-server on the headnode did not have enough spare servers 

assigned and thus was running out of capacity. That also lead to strange 

behaviour which I fixed by increasing the numbers of spare servers. </pre>

    </blockquote>

    In this case, there's probably only a single client accessing one of

    these NFS shares at a time. Maybe 2-3 at most, so I don't think it's

    likely that the server is being ovewhelmed by clients in this case. 

    <br>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">

The way I have done that was setting this in 

/etc/default/nfs-kernel-server

# Number of servers to start up

RPCNFSDCOUNT=32

That seems to provide the right amount of servers and spare ones for me. 

Like in your case, the cluster was running stable until I added more nodes 

*and* users decided to use them, i.e. the load of the cluster got up. A more 

idle cluster did not show any problems, a cluster under 80 % load suddenly had 

problem. 

I hope that helps a bit. I am not the expert in NFS as well and this is just 

my experience. I am also using Debian nfs-kernel-server 1:1.2.6-4 if that 

helps. </pre>

    </blockquote>

    <br>

    I think it's unlikely that this will fix my issue, but I'm not

    ruling anything out at this time. Thanks for the suggestion.  <br>

    <blockquote

      cite="mid:201704192021.05212.j.sassmannshausen@ucl.ac.uk"

      type="cite">

      <pre wrap="">

All the best from a sunny London

Jörg

On Mittwoch 19 April 2017 Prentice Bisbal wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">On 04/19/2017 02:17 PM, Ellis H. Wilson III wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">On 04/19/2017 02:11 PM, Prentice Bisbal wrote:

</pre>

          <blockquote type="cite">

            <pre wrap="">Thanks for the suggestion(s). Just this morning I started considering

the network as a possible source of error. My stale file handle errors

are easily fixed by just restarting the nfs servers with 'service nfs

restart', so they aren't as severe you describe.

</pre>

          </blockquote>

          <pre wrap="">

If a restart on solely the /server-side/ gets you back into a good

state this is an interesting tidbit.

</pre>

        </blockquote>

        <pre wrap="">

That is correct, restarting NFS on the server-side is all it takes to

fix the problem

</pre>

        <blockquote type="cite">

          <pre wrap="">Do you have some form of HA setup for NFS?  Automatic failover

(sometimes setup with IP aliasing) in the face of network hiccups can

occasionally goof the clients if they aren't setup properly to keep up

with the change.  A restart of the server will likely revert back to

using the primary, resulting in the clients thinking everything is

back up and healthy again.  This situation varies so much between

vendors it's hard to say much more without more details on your setup.

</pre>

        </blockquote>

        <pre wrap="">

My setup isn't nearly that complicated. Every node in this cluster has a

/local directory that is shared out to the other nodes in the cluster.

The other nodes automount this by remote directory as /l/hostname, where

"hostname" is the name of owner of the filesystem. For example, hostB

will mount hostA:/local as /l/lhostA.

No fancy fail-over or anything like that.

</pre>

        <blockquote type="cite">

          <pre wrap="">Best,

ellis

P.S., apologies for the top-post last time around.

</pre>

        </blockquote>

        <pre wrap="">

NO worries. I'm so used to people doing that, in mailing lists that I've

become numb to it.

Prentice

_______________________________________________

Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing

To change your subscription (digest mode or unsubscribe) visit

<a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>

</pre>

      </blockquote>

      <pre wrap="">

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing

To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>