<div dir="ltr"><div><div><div>Dear Prof. Marks,<br><br></div>   As I suspected, users can not use ganglia. Our administrators are very jealous !!<br><br></div>Dear Elias Assmann,<br><br></div>   Many thanks for your comments. I will try to comment on some of them.<br><br><div class="gmail_extra"><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

First of all, I wonder: To what extent is this problem reproducible?<br>

E.g., does your job always run on the same 4 nodes? </blockquote><div><br></div><div>Yes.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Is it always the<br>

same node(s) that are slow?  </blockquote><div><br></div><div>Yes<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Does the problem also show up in other<br>

calculations (maybe just changing the number of k-points, or<br>

restarting the same case from scratch). </blockquote><div><br></div><div>The strangest part: at the beginning of this month, the same calculation was running properly. I had a crash for convergence problems and when I reduced the "mixing factor" in case.inm (it is now 0.04 in pre-convergence scf cycle) the problems started. Obviously, I do not believe that the mixing factor is the problem.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Is it only lapw1 that is slow?<br></blockquote><div><br></div><div>No. All the executables are running slowly in the problematic node.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Second, how did you make those ‘top’s?  As for ‘lapw0’ and ‘lapw1’, I<br>

am guessing that this is just because the snapshots were taken at<br>

different times (notice that the CPU times of lapw0 on the two nodes<br>

are quite different, too).<br></blockquote><div><br></div><div>Users can do nothing. The administrator sent me the "top's" and I have asked him for simultaneous ones.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

About the CPU usage on ‘n2’, I find this very suspicious.  If it is as<br>

Peter said that the jobs are in the initialization and therefore not<br>

computing much, that may be fine; but I have to disagree with his<br>

assessment, because the memory usage of lapw1 on the two nodes is<br>

basically the same (if anything, the image sizes on ‘n2’ are slightly<br>

larger).  Note also that it is *not* the case that other processes are<br>

using the CPU; the total usage is at 7.5 %.<br>

<br>

It would be good to clarify that by getting a ‘top’ such that we know<br>

that lapw1 had been running for a while.  To this end, top has an ‘-n’<br>

option which says how many frames to output, e.g. ‘top -bn 10’.<br>

<br>

I am also curious about the load averages.  ‘n2’ has larger “mid-term”<br>

and “long-term” load averages than the others, and its “short-term”<br>

average is just as large.  I am not sure what that means.<br>

<span class=""><br>

On 09/23/2015 02:21 PM, Luis Ogando wrote:<br>

> I can not access the nodes. SSH among them is forbidden ! We have<br>

> to ask the administrators for anything !! It is the hell !! Of<br>

> course, only the PBS jobs can "travel" among the nodes.<br>

<br>

</span>I do not know about PBS Pro, but Torque and SGE have an option (I<br>

think ‘-I’ in either case) to submit an interactive job where you get<br>

a login on a node.  Of course that is only a realistic option when the<br>

queuing time is not too long.  Otherwise, any information that a more<br>

sophisticated tool can give you will also be available from the<br>

command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’,<br>

etc.  You can also put these things in a jobs script (which you<br>

apparently already did with ‘top’).<br>

<br>

<br>

Good luck,<br>

<br>

        Elias<br></blockquote><div><br></div><div>    Finally, I would like to thank all the comments and say that if I did not comment on them is because the administrators said they can not be the origin of the problem, "everything is 0K" (?).<br></div><div>   All the best,<br></div><div>                  Luis<br><br></div></div></div></div>