<p dir="ltr">If it happens again, one thing to ask them to check is swap usage and how much memory is cached. On some of my nodes I have noticed that they do not always release cached memory, and can start swapping. If this happens the job will get very slow. The commands to use to clear the cache can be found at <br>
<a href="http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/">http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/</a> or similar. (Needs root access.) Top can also show memory use.</p>
<p dir="ltr">While there should be no need to do this, I have noticed that I need to do it every 3hrs on 4 nodes - the other 20 don't need it. It is an issue mainly for big calculations.</p>
<p dir="ltr">Alternatively it was something else, a zombie, big log files or other things. Rebooting gets rid of a lot of system caches and helps -- even on my Android tablet every week or two. It's murky waters.</p>
<p dir="ltr">---<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu">http://www.numis.northwestern.edu</a><br>
Corrosion in 4D<a href="http://MURI4D.numis.northwestern.edu"> http://MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody else has thought"<br>
Albert Szent-Gyorgi</p>
<div class="gmail_quot<blockquote class=" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr">
<div>
<div>
<div>Hi Elias,<br>
<br>
</div>
There were no other jobs in the specific queue I was using and the nodes are dedicated to that queue, so, it was the opportunity to reboot them without furious reactions from other users.
<br>
After trying everything suggested by the Wien2k community, the administrators resignedly remembered the words of wisdom given by the cluster guru, Shakespeare, and followed the suggestion given by Lyudmila Dobysheva. In other words, they killed my job, restarted
all the nodes and I resubmitted the calculation<br>
</div>
All the best,<br>
</div>
Luis<br>
<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">2015-09-29 3:50 GMT-03:00 Elias Assmann <span dir="ltr">
<<a href="mailto:elias.assmann@gmail.com" target="_blank">elias.assmann@gmail.com</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span>-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
</span><span>On 09/28/2015 01:58 PM, Luis Ogando wrote:<br>
> The problem is solved ! The solution was one suggested by Lyudmila<br>
> Dobysheva : reboot the nodes. We will never know the origin of the<br>
> problem, but, honestly, I do not care !<br>
<br>
</span>Good to hear that! So, how did you get the admins to reboot them?<br>
<span><br>
> "There are more things in heaven and earth, Horatio, Than are<br>
> dreamt of in your philosophy."<br>
<br>
</span>That is an apt quote for people working on clusters ;-).<br>
<span><br>
<br>
Elias<br>
<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v1<br>
Comment: Using GnuPG with Icedove - <a href="http://www.enigmail.net/" rel="noreferrer" target="_blank">
http://www.enigmail.net/</a><br>
<br>
</span>iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK<br>
oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF<br>
f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM<br>
XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e<br>
CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF<br>
PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj<br>
kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK<br>
v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr<br>
Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK<br>
4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg<br>
FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF<br>
QDT7WZ/DqG+KpcVTRmnz<br>
=JtdF<br>
-----END PGP SIGNATURE-----<br>
<div>
<div>_______________________________________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
SEARCH the MAILING-LIST at: <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>