Dear Prof. Peter Blaha,<div><br></div><div>Thank you for your reply.</div><div><br class="Apple-interchange-newline"><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not meaningful.</blockquote></div><div><br></div><div>The computing node has really 16 cores (two AMD Opteron(tm) Processor 6136 cpus) and 32 Gb momery. So the 36 k-points are divided by 16 cores, 3 k-points for 4 cores and 2 k-points for the other 12 cores. As you suggestion, if I only use 12 cores, it might be take less time in lapw1. </div>
<div><br></div><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But this works only<br> if your k-list and .machines file is "compatible" as mentioned above.</blockquote><div><br></div><div>Actually, the administrator just changed my /home directory to a local disk in the login node. Before this, the heavy I/O has never happened through a network disk array. I guess this may be the reason for the crash. </div>
<div><br></div><div>Any comments will be appreciated.</div><div><br></div><div>Best,</div><div><div> </div><br><div class="gmail_quote">On Fri, Feb 3, 2012 at 9:53 PM, Peter Blaha <span dir="ltr"><<a href="mailto:pblaha@theochem.tuwien.ac.at" target="_blank">pblaha@theochem.tuwien.ac.at</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
Clearly you should write your job script such that it divides the 36 k-points in a<br>
"meaningful" way.<br>
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not meaningful.<br>
<br>
Furthermore, it seems that your cluster has problems with heavy I/O (NFS) and this is<br>
most likely the reason for the observed high load and the crash. Thus I would<br>
i) not use too many cores. Has one node of your cluster really 16 cores, or is this just due<br>
to "multithreading" and in fact it has only 8 ? Do you have enough memory per node ?<br>
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But this works only<br>
if your k-list and .machines file is "compatible" as mentioned above.<br>
<br>
It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus you may either need MPI<br>
or you should not use all cores on one node at your cluster because of memory restrictions.<br>
<br>
<br>
Am 03.02.2012 13:56, schrieb Bin Shao:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>
Dear all,<br>
<br>
I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing system. The job is submitted in a k-point parallel mode and the total 36 kpoints are divided by 16 cups.<br>
But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are not empty. At the same time, the job in pbs system seems dead and can not be killed by the pbs<br>
command. The administrator check the computing node and command top shows that the node is experiencing very heavy load above 40. Further, ps aux shows that there are 16 lapw2<br>
processes but not running or say suspended. The jobs caused a heavy load and triggered the self-protection mechanism of the OS, which automatically suspends any running process<br>
including ssh login except root account.<br>
<br>
Any comments will be appreciated and thanks in advanced.<br>
<br>
The followings are the error files and case.dayfile.<br>
--------------------dnlapw2_<u></u>18/19/20.error----------------<u></u>--<br>
Error in LAPW2<br>
------------------------------<u></u>------------------------------<u></u>------------<br>
<br>
---------------------case.<u></u>output2dn_19------------------<u></u>------<br>
...<br>
KVEC( 73563) = -19 -5 9 9.1046 1<br>
KVEC( 73564) = -19 24 -9 9.1046 1<br>
KVEC( 73565) = -19 24 9 9.1046 1<br>
KVEC( 73566) = 19 -24 -9 9.1046 1<br>
KVEC( 73567) = 19 -24 9 9.1046 1<br>
KVEC( 73568) = 19 5 -9 9.1046 1<br>
KVEC( 73569) = 19 5 9 9.1046 1<br>
KVE<br>
------------------------------<u></u>------------------------------<u></u>------------<br>
<br>
--------------------case.<u></u>dayfile-----------------------<u></u>------------<br>
...<br>
[14] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;<br>
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )<br>
[9] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;<br>
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )<br>
[4] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;<br>
if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )<br>
[4] 18809<br>
------------------------------<u></u>------------------------------<u></u>-----------------<br>
<br>
-----------------------------:<u></u>log---------------------------<u></u>-----------------<br>
...<br>
Thu Feb 2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb<br>
Thu Feb 2 19:46:53 CST 2012> (x) lapw2 -c -up -p<br>
Thu Feb 2 19:51:36 CST 2012> (x) sumpara -up -d<br>
Thu Feb 2 19:52:07 CST 2012> (x) lapw2 -c -dn -p<br>
------------------------------<u></u>------------------------------<u></u>--------------------<br>
<br>
(If more information is needed, I will provide.)<br>
<br>
Best,<br>
<br>
--<br>
Bin Shao, Ph.D. Candidate<br>
College of Information Technical Science, Nankai University<br>
94 Weijin Rd. Nankai Dist. Tianjin 300071, China<br></div></div>
Email: <a href="mailto:bshao@mail.nankai.edu.cn" target="_blank">bshao@mail.nankai.edu.cn</a> <mailto:<a href="mailto:bshao@mail.nankai.edu.cn" target="_blank">bshao@mail.nankai.edu.<u></u>cn</a>><br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.<u></u>at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.<u></u>ac.at/mailman/listinfo/wien</a><span><font color="#888888"><br>
</font></span></blockquote><span><font color="#888888">
<br>
-- <br>
<br>
P.Blaha<br>
------------------------------<u></u>------------------------------<u></u>--------------<br>
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna<br>
Phone: <a href="tel:%2B43-1-58801-165300" value="+43158801165300" target="_blank">+43-1-58801-165300</a> FAX: <a href="tel:%2B43-1-58801-165982" value="+43158801165982" target="_blank">+43-1-58801-165982</a><br>
Email: <a href="mailto:blaha@theochem.tuwien.ac.at" target="_blank">blaha@theochem.tuwien.ac.at</a> WWW: <a href="http://info.tuwien.ac.at/theochem/" target="_blank">http://info.tuwien.ac.at/<u></u>theochem/</a><br>
------------------------------<u></u>------------------------------<u></u>--------------<br>
______________________________<u></u>_________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.<u></u>at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.<u></u>ac.at/mailman/listinfo/wien</a><br>
</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Bin Shao, Ph.D. Candidate<br>College of Information Technical Science, Nankai University<br>94 Weijin Rd. Nankai Dist. Tianjin 300071, China<br>Email: <a href="mailto:bshao@mail.nankai.edu.cn" target="_blank">bshao@mail.nankai.edu.cn</a><br>
<br>
</div>