[Wien] time difference among nodes

Luis Ogando lcodacal at gmail.com
Thu Sep 24 19:37:57 CEST 2015


Dear Prof. Marks,

   As I suspected, users can not use ganglia. Our administrators are very
jealous !!

Dear Elias Assmann,

   Many thanks for your comments. I will try to comment on some of them.


First of all, I wonder: To what extent is this problem reproducible?
> E.g., does your job always run on the same 4 nodes?


Yes.


> Is it always the
> same node(s) that are slow?


Yes


> Does the problem also show up in other
> calculations (maybe just changing the number of k-points, or
> restarting the same case from scratch).


The strangest part: at the beginning of this month, the same calculation
was running properly. I had a crash for convergence problems and when I
reduced the "mixing factor" in case.inm (it is now 0.04 in pre-convergence
scf cycle) the problems started. Obviously, I do not believe that the
mixing factor is the problem.


> Is it only lapw1 that is slow?
>

No. All the executables are running slowly in the problematic node.


>
> Second, how did you make those ‘top’s?  As for ‘lapw0’ and ‘lapw1’, I
> am guessing that this is just because the snapshots were taken at
> different times (notice that the CPU times of lapw0 on the two nodes
> are quite different, too).
>

Users can do nothing. The administrator sent me the "top's" and I have
asked him for simultaneous ones.


>
> About the CPU usage on ‘n2’, I find this very suspicious.  If it is as
> Peter said that the jobs are in the initialization and therefore not
> computing much, that may be fine; but I have to disagree with his
> assessment, because the memory usage of lapw1 on the two nodes is
> basically the same (if anything, the image sizes on ‘n2’ are slightly
> larger).  Note also that it is *not* the case that other processes are
> using the CPU; the total usage is at 7.5 %.
>
> It would be good to clarify that by getting a ‘top’ such that we know
> that lapw1 had been running for a while.  To this end, top has an ‘-n’
> option which says how many frames to output, e.g. ‘top -bn 10’.
>
> I am also curious about the load averages.  ‘n2’ has larger “mid-term”
> and “long-term” load averages than the others, and its “short-term”
> average is just as large.  I am not sure what that means.
>
> On 09/23/2015 02:21 PM, Luis Ogando wrote:
> > I can not access the nodes. SSH among them is forbidden ! We have
> > to ask the administrators for anything !! It is the hell !! Of
> > course, only the PBS jobs can "travel" among the nodes.
>
> I do not know about PBS Pro, but Torque and SGE have an option (I
> think ‘-I’ in either case) to submit an interactive job where you get
> a login on a node.  Of course that is only a realistic option when the
> queuing time is not too long.  Otherwise, any information that a more
> sophisticated tool can give you will also be available from the
> command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’,
> etc.  You can also put these things in a jobs script (which you
> apparently already did with ‘top’).
>
>
> Good luck,
>
>         Elias
>

    Finally, I would like to thank all the comments and say that if I did
not comment on them is because the administrators said they can not be the
origin of the problem, "everything is 0K" (?).
   All the best,
                  Luis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150924/fb355f3b/attachment.html>


More information about the Wien mailing list