<div dir="ltr">Trying to decrease the size of a previous message !!!<br><div><div class="gmail_quote"><br>--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br><div dir="ltr">Dear Prof. Blaha and Marks,<br><br>   Please, find below the "top" output for my calculation.<br>   As you can see, there is a huge difference in CPU use for the r1i1n2 node (the problematic one). What could be the reason ? What can I do ?<br>  We also have the first two nodes executing lapw0_mpi while the other two are executing lapw1c_mpi. Is this normal ?<br>   Thank you again,<br>        Luis<tt><br></tt><div><div><tt>
    </tt><div><div><div>================================================================================================<br><br>r1i1n0<br><br>top - 17:41:29 up 11 days,  8:49,  2 users,  load average: 10.95, 4.99, 2.01<br>Tasks: 248 total,  13 running, 235 sleeping,   0 stopped,   0 zombie<br>Cpu(s): 99.9%us,  0.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st<br>Mem:     36176M total,     8820M used,    27355M free,        0M buffers<br>Swap:        0M total,        0M used,        0M free,     7248M cached<br><br>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                    <br> 6670 ogando    20   0  517m  70m  14m R  100  0.2   2:22.27 lapw0_mpi                                                  <br> 6671 ogando    20   0  511m  71m  19m R  100  0.2   2:22.57 lapw0_mpi                                                  <br> 6672 ogando    20   0  512m  67m  15m R  100  0.2   2:22.26 lapw0_mpi                                                  <br> 6673 ogando    20   0  511m  69m  18m R  100  0.2   2:22.49 lapw0_mpi                                                  <br> 6674 ogando    20   0  511m  64m  13m R  100  0.2   2:22.69 lapw0_mpi                                                  <br> 6675 ogando    20   0  511m  67m  16m R  100  0.2   2:22.63 lapw0_mpi                                                  <br> 6676 ogando    20   0  511m  63m  12m R  100  0.2   2:22.24 lapw0_mpi                                                  <br> 6677 ogando    20   0  511m  62m  11m R  100  0.2   2:22.59 lapw0_mpi                                                  <br> 6679 ogando    20   0  511m  67m  16m R  100  0.2   2:22.20 lapw0_mpi                                                  <br> 6681 ogando    20   0  512m  62m  11m R  100  0.2   2:22.70 lapw0_mpi                                                  <br> 6678 ogando    20   0  511m  64m  13m R  100  0.2   2:22.64 lapw0_mpi                                                  <br> 6680 ogando    20   0  510m  62m  12m R  100  0.2   2:22.55 lapw0_mpi                                                  <br>  924 ogando    20   0 12916 1620  996 S    0  0.0   0:00.28 run_lapw                                                   <br> 6506 ogando    20   0 13024 1820  992 S    0  0.0   0:00.02 x                                                          <br> 6527 ogando    20   0 12740 1456  996 S    0  0.0   0:00.02 lapw0para                                                  <br> 6669 ogando    20   0 74180 3632 2236 S    0  0.0   0:00.09 mpirun                                                     <br>17182 ogando    20   0 13308 1892 1060 S    0  0.0   0:00.13 csh                                                        <br>17183 ogando    20   0 10364  656  396 S    0  0.0   0:00.40 pbs_demux                                                  <br>17203 ogando    20   0 12932 1720 1008 S    0  0.0   0:00.07 csh         <br><br><br>r1i1n1<br><br>top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74<br>Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie<br>Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st<br>Mem:     36176M total,    36080M used,       96M free,        0M buffers<br>Swap:        0M total,        0M used,        0M free,    34456M cached<br><br>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                    <br>27446 ogando    20   0  516m  65m 9368 R  100  0.2   1:34.78 lapw0_mpi                                                  <br>27447 ogando    20   0  517m  66m 9432 R  100  0.2   1:35.16 lapw0_mpi                                                  <br>27448 ogando    20   0  516m  65m 9412 R  100  0.2   1:34.88 lapw0_mpi                                                  <br>27449 ogando    20   0  516m  65m 9464 R  100  0.2   1:33.37 lapw0_mpi                                                  <br>27450 ogando    20   0  515m  65m 9440 R  100  0.2   1:33.96 lapw0_mpi                                                  <br>27453 ogando    20   0  516m  65m 9480 R  100  0.2   1:35.44 lapw0_mpi                                                  <br>27454 ogando    20   0  515m  65m 9424 R  100  0.2   1:35.85 lapw0_mpi                                                  <br>27455 ogando    20   0  516m  65m 9452 R  100  0.2   1:34.47 lapw0_mpi                                                  <br>27456 ogando    20   0  516m  65m 9440 R  100  0.2   1:34.78 lapw0_mpi                                                  <br>27457 ogando    20   0  516m  65m 9420 R  100  0.2   1:30.90 lapw0_mpi                                                  <br>27451 ogando    20   0  517m  65m 9472 R  100  0.2   1:34.65 lapw0_mpi                                                  <br>27452 ogando    20   0  516m  65m 9436 R  100  0.2   1:33.63 lapw0_mpi                                                  <br>27445 ogando    20   0 67540 3336 2052 S    0  0.0   0:00.11 orted          <br><br>r1i1n2<br><br>top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79<br>Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie<br>Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st<br>Mem:     36176M total,    31464M used,     4712M free,        0M buffers<br>Swap:        0M total,        0M used,        0M free,    10563M cached<br><br>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                    <br> 2096 ogando    20   0  927m 642m  20m R    9  1.8   0:09.30 lapw1c_mpi                                                 <br> 2109 ogando    20   0  926m 633m  17m R    9  1.8   0:14.58 lapw1c_mpi                                                 <br> 2122 ogando    20   0  924m 633m  19m R    9  1.8   0:09.65 lapw1c_mpi                                                 <br> 2124 ogando    20   0  922m 627m  15m R    9  1.7   0:06.72 lapw1c_mpi                                                 <br> 2108 ogando    20   0  927m 633m  17m R    8  1.8   0:09.04 lapw1c_mpi                                                 <br> 2110 ogando    20   0  926m 633m  17m R    8  1.7   0:09.01 lapw1c_mpi                                                 <br> 2111 ogando    20   0  924m 627m  13m R    8  1.7   0:14.56 lapw1c_mpi                                                 <br> 2095 ogando    20   0  930m 641m  17m R    8  1.8   0:09.32 lapw1c_mpi                                                 <br> 2121 ogando    20   0  927m 634m  17m R    8  1.8   0:06.76 lapw1c_mpi                                                 <br> 2123 ogando    20   0  924m 632m  18m R    8  1.7   0:09.65 lapw1c_mpi                                                 <br> 2098 ogando    20   0  922m 634m  16m R    8  1.8   0:06.71 lapw1c_mpi                                                 <br> 2097 ogando    20   0  927m 641m  19m R    7  1.8   0:06.75 lapw1c_mpi                                                 <br> 2094 ogando    20   0 67048 2928 2052 S    0  0.0   0:00.02 orted                                                      <br> 2099 ogando    20   0 67048 2932 2052 S    0  0.0   0:00.01 orted                                                      <br> 2120 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.01 orted     <br><br>r1i1n3<br><br>top - 17:42:50 up 56 days,  3:25,  1 user,  load average: 10.57, 6.02, 2.59<br>Tasks: 241 total,  13 running, 228 sleeping,   0 stopped,   0 zombie<br>Cpu(s): 99.5%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st<br>Mem:     36176M total,    31395M used,     4781M free,        0M buffers<br>Swap:        0M total,        0M used,        0M free,    23089M cached<br><br>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                    <br>26197 ogando    20   0  913m 634m  17m R  100  1.8   0:59.33 lapw1c_mpi                                                 <br>26199 ogando    20   0  909m 630m  16m R  100  1.7   0:59.39 lapw1c_mpi                                                 <br>26200 ogando    20   0  906m 624m  13m R  100  1.7   0:59.35 lapw1c_mpi                                                 <br>26260 ogando    20   0  910m 631m  16m R  100  1.7   0:57.17 lapw1c_mpi                                                 <br>26271 ogando    20   0  904m 625m  17m R  100  1.7   0:54.98 lapw1c_mpi                                                 <br>26273 ogando    20   0  903m 624m  16m R  100  1.7   0:55.03 lapw1c_mpi                                                 <br>26274 ogando    20   0  903m 620m  13m R  100  1.7   0:55.04 lapw1c_mpi                                                 <br>26258 ogando    20   0  913m 634m  17m R  100  1.8   0:57.17 lapw1c_mpi                                                 <br>26259 ogando    20   0  910m 631m  16m R  100  1.7   0:57.16 lapw1c_mpi                                                 <br>26261 ogando    20   0  908m 625m  13m R  100  1.7   0:57.22 lapw1c_mpi                                                 <br>26272 ogando    20   0  903m 624m  17m R  100  1.7   0:54.97 lapw1c_mpi                                                 <br>26198 ogando    20   0  909m 631m  17m R   99  1.7   0:59.34 lapw1c_mpi                                                 <br>26196 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.02 orted                                                      <br>26257 ogando    20   0 67048 2928 2052 S    0  0.0   0:00.01 orted                                                      <br>26270 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.01 orted<br></div></div></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote"><span class="">2015-09-21 10:22 GMT-03:00 Peter Blaha <span dir="ltr"><<a href="mailto:pblaha@theochem.tuwien.ac.at" target="_blank">pblaha@theochem.tuwien.ac.at</a>></span>:<br></span><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">a) Check your .machines file.  DFoes it meet your expectations, or has this node too large load.<br>
<br>
b) Can you interactively login into these nodes while your job is running ?<br>
If yes, login on 2 nodes (in two windows) and run    top<br>
<br>
c) If nothing obvious is wrong so far, test the network by doing some bigger copying from/to these nodes from your $home (or $scratch) to see if file-io is killing you.<span><br>
<br>
<br>
On 09/21/2015 02:51 PM, Luis Ogando wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>
Dear Prof. Marks,<br>
<br>
    Many thanks for your help.<br>
    The administrators said that everything is 0K, the software is the<br>
problem (the easy answer) : no zombies, no other jobs in the node, ... !!<br>
    Let me give you more information to see if you can imagine other<br>
possibilities:<br>
<br>
1) Intel Xeon Six Core 5680, 3.33GHz<br>
<br>
2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications<br>
running on Intel(R) 64, Version 12.1.1.256 Build 20111011<br>
<br>
3) OpenMPI 1.6.5<br>
<br>
4) PBS Pro 11.0.2<br>
<br>
5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (<br>
<a href="http://www.open-mpi.org/faq/?category=building#build-rte-tm" rel="noreferrer" target="_blank">http://www.open-mpi.org/faq/?category=building#build-rte-tm</a> )<br>
<br>
6) Wien2k 14.2<br>
<br>
7) The mystery : two weeks ago, everything was working properly !!<br>
<br>
    Many thanks again !<br>
    All the best,<br>
                    Luis<br>
<br>
2015-09-18 23:24 GMT-03:00 Laurence Marks <<a href="mailto:laurence.marks@gmail.com" target="_blank">laurence.marks@gmail.com</a><br></span>
<mailto:<a href="mailto:laurence.marks@gmail.com" target="_blank">laurence.marks@gmail.com</a>>>:<span><br>
<br>
    Almost certainly one or more of:<br>
    * Other jobs on the node<br>
    * Zombie process(es)<br>
    * Too many mpi<br>
    * Bad memory<br>
    * Full disc<br>
    * Too hot<br>
<br>
    If you have it use ganglia, if not ssh in and use top/ps or whatever<br>
    SGI has. If you cannot sudo get help from someone who can.<br>
<br>
    On Sep 18, 2015 8:58 PM, "Luis Ogando" <<a href="mailto:lcodacal@gmail.com" target="_blank">lcodacal@gmail.com</a><br></span><span>
    <mailto:<a href="mailto:lcodacal@gmail.com" target="_blank">lcodacal@gmail.com</a>>> wrote:<br>
<br>
        Dear Wien2k community,<br>
<br>
            I am using Wien2k in a SGI cluster with 32 nodes. My<br>
        calculation is running in 4 nodes that have the same<br>
        characteristics and only my job is running in these 4 nodes.<br>
            I noticed that one of these 4 nodes is spending more than 20<br>
        times the time spent by the other 3 nodes in the run_lapw execution.<br>
            Could someone imagine a reason for this ? Any advice ?<br>
            All the best,<br>
                     Luis<br>
<br>
<br>
    _______________________________________________<br>
    Wien mailing list<br></span>
    <a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a> <mailto:<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a>><span><br>
    <a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
    SEARCH the MAILING-LIST at:<br>
    <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
SEARCH the MAILING-LIST at:  <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
<br>
</span></blockquote><span><font color="#888888">
<br>
-- <br>
<br>
                                      P.Blaha<br>
--------------------------------------------------------------------------<br>
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna<br>
Phone: <a href="tel:%2B43-1-58801-165300" value="+43158801165300" target="_blank">+43-1-58801-165300</a>             FAX: <a href="tel:%2B43-1-58801-165982" value="+43158801165982" target="_blank">+43-1-58801-165982</a><br>
Email: <a href="mailto:blaha@theochem.tuwien.ac.at" target="_blank">blaha@theochem.tuwien.ac.at</a>    WIEN2k: <a href="http://www.wien2k.at" rel="noreferrer" target="_blank">http://www.wien2k.at</a><br>
WWW:   <a href="http://www.imc.tuwien.ac.at/staff/tc_group_e.php" rel="noreferrer" target="_blank">http://www.imc.tuwien.ac.at/staff/tc_group_e.php</a><br>
--------------------------------------------------------------------------</font></span><div><div><br>
_______________________________________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
SEARCH the MAILING-LIST at:  <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
</div></div></blockquote></div></div></div><br></div>
</div><br></div></div>