[Wien] time difference among nodes

Wed Sep 23 08:32:54 CEST 2015

Of course, at the "same time" ONLY lapw0_mpi  OR lapw1_mpi should be 
running.
However, I assume you did these  "tops"  sequentially one after the 
other ??? and of course, in an scf-cycle, after a few minutes running 
lapw0, lapw1 will start ....
Do these tests in several windows in parallel.

The only suspicious info is the memory consumption. On the slow node you 
see:
 > Mem:     36176M total,     8820M used,    27355M free,
on the fast one:
 > Mem:     36176M total,    36080M used,       96M free,

It may indicate, that the slow node has a different configuration, in 
particular does not seem to buffer I/O, ... but keeps only the running 
program (12 x 500MB) in memory. The fast one uses "all" memory, and 
typically this is used by the operating system to hold various daemons 
and buffers permanently in memory.
The latter behavior is what I see normally on my nodes and what should 
be the default behavior.

On 09/22/2015 10:08 PM, Luis Ogando wrote:
> Trying to decrease the size of a previous message !!!
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Dear Prof. Blaha and Marks,
>
>     Please, find below the "top" output for my calculation.
>     As you can see, there is a huge difference in CPU use for the r1i1n2
> node (the problematic one). What could be the reason ? What can I do ?
>    We also have the first two nodes executing lapw0_mpi while the other
> two are executing lapw1c_mpi. Is this normal ?
>     Thank you again,
>          Luis
> ================================================================================================
>
> r1i1n0
>
> top - 17:41:29 up 11 days,  8:49,  2 users,  load average: 10.95, 4.99, 2.01
> Tasks: 248 total,  13 running, 235 sleeping,   0 stopped,   0 zombie
> Cpu(s): 99.9%us,  0.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>   0.0%st
> Mem:     36176M total,     8820M used,    27355M free,        0M buffers
> Swap:        0M total,        0M used,        0M free,     7248M cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   6670 ogando    20   0  517m  70m  14m R  100  0.2   2:22.27 lapw0_mpi
>   6671 ogando    20   0  511m  71m  19m R  100  0.2   2:22.57 lapw0_mpi
>   6672 ogando    20   0  512m  67m  15m R  100  0.2   2:22.26 lapw0_mpi
>   6673 ogando    20   0  511m  69m  18m R  100  0.2   2:22.49 lapw0_mpi
>   6674 ogando    20   0  511m  64m  13m R  100  0.2   2:22.69 lapw0_mpi
>   6675 ogando    20   0  511m  67m  16m R  100  0.2   2:22.63 lapw0_mpi
>   6676 ogando    20   0  511m  63m  12m R  100  0.2   2:22.24 lapw0_mpi
>   6677 ogando    20   0  511m  62m  11m R  100  0.2   2:22.59 lapw0_mpi
>   6679 ogando    20   0  511m  67m  16m R  100  0.2   2:22.20 lapw0_mpi
>   6681 ogando    20   0  512m  62m  11m R  100  0.2   2:22.70 lapw0_mpi
>   6678 ogando    20   0  511m  64m  13m R  100  0.2   2:22.64 lapw0_mpi
>   6680 ogando    20   0  510m  62m  12m R  100  0.2   2:22.55 lapw0_mpi
>    924 ogando    20   0 12916 1620  996 S    0  0.0   0:00.28 run_lapw
>   6506 ogando    20   0 13024 1820  992 S    0  0.0   0:00.02 x
>   6527 ogando    20   0 12740 1456  996 S    0  0.0   0:00.02 lapw0para
>   6669 ogando    20   0 74180 3632 2236 S    0  0.0   0:00.09 mpirun
> 17182 ogando    20   0 13308 1892 1060 S    0  0.0   0:00.13 csh
> 17183 ogando    20   0 10364  656  396 S    0  0.0   0:00.40 pbs_demux
> 17203 ogando    20   0 12932 1720 1008 S    0  0.0   0:00.07 csh
>
>
> r1i1n1
>
> top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
> Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie
> Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>   0.0%st
> Mem:     36176M total,    36080M used,       96M free,        0M buffers
> Swap:        0M total,        0M used,        0M free,    34456M cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 27446 ogando    20   0  516m  65m 9368 R  100  0.2   1:34.78 lapw0_mpi
> 27447 ogando    20   0  517m  66m 9432 R  100  0.2   1:35.16 lapw0_mpi
> 27448 ogando    20   0  516m  65m 9412 R  100  0.2   1:34.88 lapw0_mpi
> 27449 ogando    20   0  516m  65m 9464 R  100  0.2   1:33.37 lapw0_mpi
> 27450 ogando    20   0  515m  65m 9440 R  100  0.2   1:33.96 lapw0_mpi
> 27453 ogando    20   0  516m  65m 9480 R  100  0.2   1:35.44 lapw0_mpi
> 27454 ogando    20   0  515m  65m 9424 R  100  0.2   1:35.85 lapw0_mpi
> 27455 ogando    20   0  516m  65m 9452 R  100  0.2   1:34.47 lapw0_mpi
> 27456 ogando    20   0  516m  65m 9440 R  100  0.2   1:34.78 lapw0_mpi
> 27457 ogando    20   0  516m  65m 9420 R  100  0.2   1:30.90 lapw0_mpi
> 27451 ogando    20   0  517m  65m 9472 R  100  0.2   1:34.65 lapw0_mpi
> 27452 ogando    20   0  516m  65m 9436 R  100  0.2   1:33.63 lapw0_mpi
> 27445 ogando    20   0 67540 3336 2052 S    0  0.0   0:00.11 orted
>
> r1i1n2
>
> top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
> Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie
> Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,
>   0.0%st
> Mem:     36176M total,    31464M used,     4712M free,        0M buffers
> Swap:        0M total,        0M used,        0M free,    10563M cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   2096 ogando    20   0  927m 642m  20m R    9  1.8   0:09.30 lapw1c_mpi
>   2109 ogando    20   0  926m 633m  17m R    9  1.8   0:14.58 lapw1c_mpi
>   2122 ogando    20   0  924m 633m  19m R    9  1.8   0:09.65 lapw1c_mpi
>   2124 ogando    20   0  922m 627m  15m R    9  1.7   0:06.72 lapw1c_mpi
>   2108 ogando    20   0  927m 633m  17m R    8  1.8   0:09.04 lapw1c_mpi
>   2110 ogando    20   0  926m 633m  17m R    8  1.7   0:09.01 lapw1c_mpi
>   2111 ogando    20   0  924m 627m  13m R    8  1.7   0:14.56 lapw1c_mpi
>   2095 ogando    20   0  930m 641m  17m R    8  1.8   0:09.32 lapw1c_mpi
>   2121 ogando    20   0  927m 634m  17m R    8  1.8   0:06.76 lapw1c_mpi
>   2123 ogando    20   0  924m 632m  18m R    8  1.7   0:09.65 lapw1c_mpi
>   2098 ogando    20   0  922m 634m  16m R    8  1.8   0:06.71 lapw1c_mpi
>   2097 ogando    20   0  927m 641m  19m R    7  1.8   0:06.75 lapw1c_mpi
>   2094 ogando    20   0 67048 2928 2052 S    0  0.0   0:00.02 orted
>   2099 ogando    20   0 67048 2932 2052 S    0  0.0   0:00.01 orted
>   2120 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.01 orted
>
> r1i1n3
>
> top - 17:42:50 up 56 days,  3:25,  1 user,  load average: 10.57, 6.02, 2.59
> Tasks: 241 total,  13 running, 228 sleeping,   0 stopped,   0 zombie
> Cpu(s): 99.5%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>   0.0%st
> Mem:     36176M total,    31395M used,     4781M free,        0M buffers
> Swap:        0M total,        0M used,        0M free,    23089M cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 26197 ogando    20   0  913m 634m  17m R  100  1.8   0:59.33 lapw1c_mpi
> 26199 ogando    20   0  909m 630m  16m R  100  1.7   0:59.39 lapw1c_mpi
> 26200 ogando    20   0  906m 624m  13m R  100  1.7   0:59.35 lapw1c_mpi
> 26260 ogando    20   0  910m 631m  16m R  100  1.7   0:57.17 lapw1c_mpi
> 26271 ogando    20   0  904m 625m  17m R  100  1.7   0:54.98 lapw1c_mpi
> 26273 ogando    20   0  903m 624m  16m R  100  1.7   0:55.03 lapw1c_mpi
> 26274 ogando    20   0  903m 620m  13m R  100  1.7   0:55.04 lapw1c_mpi
> 26258 ogando    20   0  913m 634m  17m R  100  1.8   0:57.17 lapw1c_mpi
> 26259 ogando    20   0  910m 631m  16m R  100  1.7   0:57.16 lapw1c_mpi
> 26261 ogando    20   0  908m 625m  13m R  100  1.7   0:57.22 lapw1c_mpi
> 26272 ogando    20   0  903m 624m  17m R  100  1.7   0:54.97 lapw1c_mpi
> 26198 ogando    20   0  909m 631m  17m R   99  1.7   0:59.34 lapw1c_mpi
> 26196 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.02 orted
> 26257 ogando    20   0 67048 2928 2052 S    0  0.0   0:00.01 orted
> 26270 ogando    20   0 67048 2924 2052 S    0  0.0   0:00.01 orted
>
> 2015-09-21 10:22 GMT-03:00 Peter Blaha <pblaha at theochem.tuwien.ac.at
> <mailto:pblaha at theochem.tuwien.ac.at>>:
>
>     a) Check your .machines file.  DFoes it meet your expectations, or
>     has this node too large load.
>
>     b) Can you interactively login into these nodes while your job is
>     running ?
>     If yes, login on 2 nodes (in two windows) and run    top
>
>     c) If nothing obvious is wrong so far, test the network by doing
>     some bigger copying from/to these nodes from your $home (or
>     $scratch) to see if file-io is killing you.
>
>
>     On 09/21/2015 02:51 PM, Luis Ogando wrote:
>
>         Dear Prof. Marks,
>
>              Many thanks for your help.
>              The administrators said that everything is 0K, the software
>         is the
>         problem (the easy answer) : no zombies, no other jobs in the
>         node, ... !!
>              Let me give you more information to see if you can imagine
>         other
>         possibilities:
>
>         1) Intel Xeon Six Core 5680, 3.33GHz
>
>         2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for
>         applications
>         running on Intel(R) 64, Version 12.1.1.256 Build 20111011
>
>         3) OpenMPI 1.6.5
>
>         4) PBS Pro 11.0.2
>
>         5) OpenMPI built using  --with-tm  due to prohibited ssh among
>         nodes  (
>         http://www.open-mpi.org/faq/?category=building#build-rte-tm )
>
>         6) Wien2k 14.2
>
>         7) The mystery : two weeks ago, everything was working properly !!
>
>              Many thanks again !
>              All the best,
>                              Luis
>
>         2015-09-18 23:24 GMT-03:00 Laurence Marks
>         <laurence.marks at gmail.com <mailto:laurence.marks at gmail.com>
>         <mailto:laurence.marks at gmail.com
>         <mailto:laurence.marks at gmail.com>>>:
>
>              Almost certainly one or more of:
>              * Other jobs on the node
>              * Zombie process(es)
>              * Too many mpi
>              * Bad memory
>              * Full disc
>              * Too hot
>
>              If you have it use ganglia, if not ssh in and use top/ps or
>         whatever
>              SGI has. If you cannot sudo get help from someone who can.
>
>              On Sep 18, 2015 8:58 PM, "Luis Ogando" <lcodacal at gmail.com
>         <mailto:lcodacal at gmail.com>
>              <mailto:lcodacal at gmail.com <mailto:lcodacal at gmail.com>>> wrote:
>
>                  Dear Wien2k community,
>
>                      I am using Wien2k in a SGI cluster with 32 nodes. My
>                  calculation is running in 4 nodes that have the same
>                  characteristics and only my job is running in these 4
>         nodes.
>                      I noticed that one of these 4 nodes is spending
>         more than 20
>                  times the time spent by the other 3 nodes in the
>         run_lapw execution.
>                      Could someone imagine a reason for this ? Any advice ?
>                      All the best,
>                               Luis
>
>
>              _______________________________________________
>              Wien mailing list
>         Wien at zeus.theochem.tuwien.ac.at
>         <mailto:Wien at zeus.theochem.tuwien.ac.at>
>         <mailto:Wien at zeus.theochem.tuwien.ac.at
>         <mailto:Wien at zeus.theochem.tuwien.ac.at>>
>         http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>              SEARCH the MAILING-LIST at:
>         http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
>         _______________________________________________
>         Wien mailing list
>         Wien at zeus.theochem.tuwien.ac.at
>         <mailto:Wien at zeus.theochem.tuwien.ac.at>
>         http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>         SEARCH the MAILING-LIST at:
>         http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>     --
>
>                                            P.Blaha
>     --------------------------------------------------------------------------
>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>     Phone: +43-1-58801-165300 <tel:%2B43-1-58801-165300>
>       FAX: +43-1-58801-165982 <tel:%2B43-1-58801-165982>
>     Email: blaha at theochem.tuwien.ac.at
>     <mailto:blaha at theochem.tuwien.ac.at>    WIEN2k: http://www.wien2k.at
>     WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php
>     --------------------------------------------------------------------------
>
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------