[Wien] time difference among nodes
Peter Blaha
pblaha at theochem.tuwien.ac.at
Wed Sep 23 08:32:54 CEST 2015
Of course, at the "same time" ONLY lapw0_mpi OR lapw1_mpi should be
running.
However, I assume you did these "tops" sequentially one after the
other ??? and of course, in an scf-cycle, after a few minutes running
lapw0, lapw1 will start ....
Do these tests in several windows in parallel.
The only suspicious info is the memory consumption. On the slow node you
see:
> Mem: 36176M total, 8820M used, 27355M free,
on the fast one:
> Mem: 36176M total, 36080M used, 96M free,
It may indicate, that the slow node has a different configuration, in
particular does not seem to buffer I/O, ... but keeps only the running
program (12 x 500MB) in memory. The fast one uses "all" memory, and
typically this is used by the operating system to hold various daemons
and buffers permanently in memory.
The latter behavior is what I see normally on my nodes and what should
be the default behavior.
On 09/22/2015 10:08 PM, Luis Ogando wrote:
> Trying to decrease the size of a previous message !!!
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Dear Prof. Blaha and Marks,
>
> Please, find below the "top" output for my calculation.
> As you can see, there is a huge difference in CPU use for the r1i1n2
> node (the problematic one). What could be the reason ? What can I do ?
> We also have the first two nodes executing lapw0_mpi while the other
> two are executing lapw1c_mpi. Is this normal ?
> Thank you again,
> Luis
> ================================================================================================
>
> r1i1n0
>
> top - 17:41:29 up 11 days, 8:49, 2 users, load average: 10.95, 4.99, 2.01
> Tasks: 248 total, 13 running, 235 sleeping, 0 stopped, 0 zombie
> Cpu(s): 99.9%us, 0.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 36176M total, 8820M used, 27355M free, 0M buffers
> Swap: 0M total, 0M used, 0M free, 7248M cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 6670 ogando 20 0 517m 70m 14m R 100 0.2 2:22.27 lapw0_mpi
> 6671 ogando 20 0 511m 71m 19m R 100 0.2 2:22.57 lapw0_mpi
> 6672 ogando 20 0 512m 67m 15m R 100 0.2 2:22.26 lapw0_mpi
> 6673 ogando 20 0 511m 69m 18m R 100 0.2 2:22.49 lapw0_mpi
> 6674 ogando 20 0 511m 64m 13m R 100 0.2 2:22.69 lapw0_mpi
> 6675 ogando 20 0 511m 67m 16m R 100 0.2 2:22.63 lapw0_mpi
> 6676 ogando 20 0 511m 63m 12m R 100 0.2 2:22.24 lapw0_mpi
> 6677 ogando 20 0 511m 62m 11m R 100 0.2 2:22.59 lapw0_mpi
> 6679 ogando 20 0 511m 67m 16m R 100 0.2 2:22.20 lapw0_mpi
> 6681 ogando 20 0 512m 62m 11m R 100 0.2 2:22.70 lapw0_mpi
> 6678 ogando 20 0 511m 64m 13m R 100 0.2 2:22.64 lapw0_mpi
> 6680 ogando 20 0 510m 62m 12m R 100 0.2 2:22.55 lapw0_mpi
> 924 ogando 20 0 12916 1620 996 S 0 0.0 0:00.28 run_lapw
> 6506 ogando 20 0 13024 1820 992 S 0 0.0 0:00.02 x
> 6527 ogando 20 0 12740 1456 996 S 0 0.0 0:00.02 lapw0para
> 6669 ogando 20 0 74180 3632 2236 S 0 0.0 0:00.09 mpirun
> 17182 ogando 20 0 13308 1892 1060 S 0 0.0 0:00.13 csh
> 17183 ogando 20 0 10364 656 396 S 0 0.0 0:00.40 pbs_demux
> 17203 ogando 20 0 12932 1720 1008 S 0 0.0 0:00.07 csh
>
>
> r1i1n1
>
> top - 17:40:46 up 12 days, 9 min, 2 users, load average: 10.55, 4.34, 1.74
> Tasks: 242 total, 13 running, 229 sleeping, 0 stopped, 0 zombie
> Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 36176M total, 36080M used, 96M free, 0M buffers
> Swap: 0M total, 0M used, 0M free, 34456M cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 27446 ogando 20 0 516m 65m 9368 R 100 0.2 1:34.78 lapw0_mpi
> 27447 ogando 20 0 517m 66m 9432 R 100 0.2 1:35.16 lapw0_mpi
> 27448 ogando 20 0 516m 65m 9412 R 100 0.2 1:34.88 lapw0_mpi
> 27449 ogando 20 0 516m 65m 9464 R 100 0.2 1:33.37 lapw0_mpi
> 27450 ogando 20 0 515m 65m 9440 R 100 0.2 1:33.96 lapw0_mpi
> 27453 ogando 20 0 516m 65m 9480 R 100 0.2 1:35.44 lapw0_mpi
> 27454 ogando 20 0 515m 65m 9424 R 100 0.2 1:35.85 lapw0_mpi
> 27455 ogando 20 0 516m 65m 9452 R 100 0.2 1:34.47 lapw0_mpi
> 27456 ogando 20 0 516m 65m 9440 R 100 0.2 1:34.78 lapw0_mpi
> 27457 ogando 20 0 516m 65m 9420 R 100 0.2 1:30.90 lapw0_mpi
> 27451 ogando 20 0 517m 65m 9472 R 100 0.2 1:34.65 lapw0_mpi
> 27452 ogando 20 0 516m 65m 9436 R 100 0.2 1:33.63 lapw0_mpi
> 27445 ogando 20 0 67540 3336 2052 S 0 0.0 0:00.11 orted
>
> r1i1n2
>
> top - 17:42:30 up 221 days, 6:29, 1 user, load average: 10.76, 9.59, 8.79
> Tasks: 242 total, 13 running, 229 sleeping, 0 stopped, 0 zombie
> Cpu(s): 7.5%us, 0.1%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 36176M total, 31464M used, 4712M free, 0M buffers
> Swap: 0M total, 0M used, 0M free, 10563M cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2096 ogando 20 0 927m 642m 20m R 9 1.8 0:09.30 lapw1c_mpi
> 2109 ogando 20 0 926m 633m 17m R 9 1.8 0:14.58 lapw1c_mpi
> 2122 ogando 20 0 924m 633m 19m R 9 1.8 0:09.65 lapw1c_mpi
> 2124 ogando 20 0 922m 627m 15m R 9 1.7 0:06.72 lapw1c_mpi
> 2108 ogando 20 0 927m 633m 17m R 8 1.8 0:09.04 lapw1c_mpi
> 2110 ogando 20 0 926m 633m 17m R 8 1.7 0:09.01 lapw1c_mpi
> 2111 ogando 20 0 924m 627m 13m R 8 1.7 0:14.56 lapw1c_mpi
> 2095 ogando 20 0 930m 641m 17m R 8 1.8 0:09.32 lapw1c_mpi
> 2121 ogando 20 0 927m 634m 17m R 8 1.8 0:06.76 lapw1c_mpi
> 2123 ogando 20 0 924m 632m 18m R 8 1.7 0:09.65 lapw1c_mpi
> 2098 ogando 20 0 922m 634m 16m R 8 1.8 0:06.71 lapw1c_mpi
> 2097 ogando 20 0 927m 641m 19m R 7 1.8 0:06.75 lapw1c_mpi
> 2094 ogando 20 0 67048 2928 2052 S 0 0.0 0:00.02 orted
> 2099 ogando 20 0 67048 2932 2052 S 0 0.0 0:00.01 orted
> 2120 ogando 20 0 67048 2924 2052 S 0 0.0 0:00.01 orted
>
> r1i1n3
>
> top - 17:42:50 up 56 days, 3:25, 1 user, load average: 10.57, 6.02, 2.59
> Tasks: 241 total, 13 running, 228 sleeping, 0 stopped, 0 zombie
> Cpu(s): 99.5%us, 0.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 36176M total, 31395M used, 4781M free, 0M buffers
> Swap: 0M total, 0M used, 0M free, 23089M cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 26197 ogando 20 0 913m 634m 17m R 100 1.8 0:59.33 lapw1c_mpi
> 26199 ogando 20 0 909m 630m 16m R 100 1.7 0:59.39 lapw1c_mpi
> 26200 ogando 20 0 906m 624m 13m R 100 1.7 0:59.35 lapw1c_mpi
> 26260 ogando 20 0 910m 631m 16m R 100 1.7 0:57.17 lapw1c_mpi
> 26271 ogando 20 0 904m 625m 17m R 100 1.7 0:54.98 lapw1c_mpi
> 26273 ogando 20 0 903m 624m 16m R 100 1.7 0:55.03 lapw1c_mpi
> 26274 ogando 20 0 903m 620m 13m R 100 1.7 0:55.04 lapw1c_mpi
> 26258 ogando 20 0 913m 634m 17m R 100 1.8 0:57.17 lapw1c_mpi
> 26259 ogando 20 0 910m 631m 16m R 100 1.7 0:57.16 lapw1c_mpi
> 26261 ogando 20 0 908m 625m 13m R 100 1.7 0:57.22 lapw1c_mpi
> 26272 ogando 20 0 903m 624m 17m R 100 1.7 0:54.97 lapw1c_mpi
> 26198 ogando 20 0 909m 631m 17m R 99 1.7 0:59.34 lapw1c_mpi
> 26196 ogando 20 0 67048 2924 2052 S 0 0.0 0:00.02 orted
> 26257 ogando 20 0 67048 2928 2052 S 0 0.0 0:00.01 orted
> 26270 ogando 20 0 67048 2924 2052 S 0 0.0 0:00.01 orted
>
> 2015-09-21 10:22 GMT-03:00 Peter Blaha <pblaha at theochem.tuwien.ac.at
> <mailto:pblaha at theochem.tuwien.ac.at>>:
>
> a) Check your .machines file. DFoes it meet your expectations, or
> has this node too large load.
>
> b) Can you interactively login into these nodes while your job is
> running ?
> If yes, login on 2 nodes (in two windows) and run top
>
> c) If nothing obvious is wrong so far, test the network by doing
> some bigger copying from/to these nodes from your $home (or
> $scratch) to see if file-io is killing you.
>
>
> On 09/21/2015 02:51 PM, Luis Ogando wrote:
>
> Dear Prof. Marks,
>
> Many thanks for your help.
> The administrators said that everything is 0K, the software
> is the
> problem (the easy answer) : no zombies, no other jobs in the
> node, ... !!
> Let me give you more information to see if you can imagine
> other
> possibilities:
>
> 1) Intel Xeon Six Core 5680, 3.33GHz
>
> 2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for
> applications
> running on Intel(R) 64, Version 12.1.1.256 Build 20111011
>
> 3) OpenMPI 1.6.5
>
> 4) PBS Pro 11.0.2
>
> 5) OpenMPI built using --with-tm due to prohibited ssh among
> nodes (
> http://www.open-mpi.org/faq/?category=building#build-rte-tm )
>
> 6) Wien2k 14.2
>
> 7) The mystery : two weeks ago, everything was working properly !!
>
> Many thanks again !
> All the best,
> Luis
>
> 2015-09-18 23:24 GMT-03:00 Laurence Marks
> <laurence.marks at gmail.com <mailto:laurence.marks at gmail.com>
> <mailto:laurence.marks at gmail.com
> <mailto:laurence.marks at gmail.com>>>:
>
> Almost certainly one or more of:
> * Other jobs on the node
> * Zombie process(es)
> * Too many mpi
> * Bad memory
> * Full disc
> * Too hot
>
> If you have it use ganglia, if not ssh in and use top/ps or
> whatever
> SGI has. If you cannot sudo get help from someone who can.
>
> On Sep 18, 2015 8:58 PM, "Luis Ogando" <lcodacal at gmail.com
> <mailto:lcodacal at gmail.com>
> <mailto:lcodacal at gmail.com <mailto:lcodacal at gmail.com>>> wrote:
>
> Dear Wien2k community,
>
> I am using Wien2k in a SGI cluster with 32 nodes. My
> calculation is running in 4 nodes that have the same
> characteristics and only my job is running in these 4
> nodes.
> I noticed that one of these 4 nodes is spending
> more than 20
> times the time spent by the other 3 nodes in the
> run_lapw execution.
> Could someone imagine a reason for this ? Any advice ?
> All the best,
> Luis
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> <mailto:Wien at zeus.theochem.tuwien.ac.at>
> <mailto:Wien at zeus.theochem.tuwien.ac.at
> <mailto:Wien at zeus.theochem.tuwien.ac.at>>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
> --
>
> P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 <tel:%2B43-1-58801-165300>
> FAX: +43-1-58801-165982 <tel:%2B43-1-58801-165982>
> Email: blaha at theochem.tuwien.ac.at
> <mailto:blaha at theochem.tuwien.ac.at> WIEN2k: http://www.wien2k.at
> WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php
> --------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------
More information about the Wien
mailing list