[Wien] time difference among nodes

Peter Blaha pblaha at theochem.tuwien.ac.at
Mon Sep 21 15:22:11 CEST 2015


a) Check your .machines file.  DFoes it meet your expectations, or has 
this node too large load.

b) Can you interactively login into these nodes while your job is running ?
If yes, login on 2 nodes (in two windows) and run    top

c) If nothing obvious is wrong so far, test the network by doing some 
bigger copying from/to these nodes from your $home (or $scratch) to see 
if file-io is killing you.


On 09/21/2015 02:51 PM, Luis Ogando wrote:
> Dear Prof. Marks,
>
>     Many thanks for your help.
>     The administrators said that everything is 0K, the software is the
> problem (the easy answer) : no zombies, no other jobs in the node, ... !!
>     Let me give you more information to see if you can imagine other
> possibilities:
>
> 1) Intel Xeon Six Core 5680, 3.33GHz
>
> 2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications
> running on Intel(R) 64, Version 12.1.1.256 Build 20111011
>
> 3) OpenMPI 1.6.5
>
> 4) PBS Pro 11.0.2
>
> 5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (
> http://www.open-mpi.org/faq/?category=building#build-rte-tm )
>
> 6) Wien2k 14.2
>
> 7) The mystery : two weeks ago, everything was working properly !!
>
>     Many thanks again !
>     All the best,
>                     Luis
>
> 2015-09-18 23:24 GMT-03:00 Laurence Marks <laurence.marks at gmail.com
> <mailto:laurence.marks at gmail.com>>:
>
>     Almost certainly one or more of:
>     * Other jobs on the node
>     * Zombie process(es)
>     * Too many mpi
>     * Bad memory
>     * Full disc
>     * Too hot
>
>     If you have it use ganglia, if not ssh in and use top/ps or whatever
>     SGI has. If you cannot sudo get help from someone who can.
>
>     On Sep 18, 2015 8:58 PM, "Luis Ogando" <lcodacal at gmail.com
>     <mailto:lcodacal at gmail.com>> wrote:
>
>         Dear Wien2k community,
>
>             I am using Wien2k in a SGI cluster with 32 nodes. My
>         calculation is running in 4 nodes that have the same
>         characteristics and only my job is running in these 4 nodes.
>             I noticed that one of these 4 nodes is spending more than 20
>         times the time spent by the other 3 nodes in the run_lapw execution.
>             Could someone imagine a reason for this ? Any advice ?
>             All the best,
>                      Luis
>
>
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------


More information about the Wien mailing list