[Wien] time difference among nodes

Wed Sep 23 10:09:00 CEST 2015

22.09.2015 23:08, Luis Ogando wrote:
> r1i1n1 -------------
> top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
> Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> r1i1n2 -------------
> top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
> Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,
> r1i1n3 -------------
> top - 17:42:50 up 56 days,  3:25,  1 user,  load average: 10.57, 6.02, 2.59
> Cpu(s): 99.5%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,

1) The first difference which I see is: the node under question was not 
restarted 221 days. I'd start from rebooting (the problem maybe 
disappears and you never know why that problem had happened).

2) You didn't check:
 >         2015-09-18 23:24 GMT-03:00 Laurence Marks
 >              * Bad memory
 >              * Full disc

try "df" in n2 and some other for comparison. Check and send the output.
Check which is a working directory in the nodes (there should be 
something like "export SCRATCH=./" in .bashrc, make "set > aaa", and 
check the variable SCRATCH in the file aaa). Compare with output of df.

3) Just to be sure: you showed us top for only user ogando, I hope you 
really saw that there were no other users (press in top at n2 "u", and 
answer blank to "Which user (blank for all))". It writes "1 user", but 
there should be at least root, syslog, statd and so forth.

 >    We also have the first two nodes executing lapw0_mpi while the other
 > two are executing lapw1c_mpi. Is this normal ?

I do not know, looks suspicious, but, IMHO, it is not connected with the 
discussed problem.

Best wishes
   Lyudmila Dobysheva

>     On 09/21/2015 02:51 PM, Luis Ogando wrote:
>         7) The mystery : two weeks ago, everything was working properly !!
>              On Sep 18, 2015 8:58 PM, "Luis Ogando" wrote:
>                      I am using Wien2k in a SGI cluster with 32 nodes. My
>                  calculation is running in 4 nodes that have the same
>                  characteristics and only my job is running in these 4
>         nodes.
>                      I noticed that one of these 4 nodes is spending
>         more than 20
>                  times the time spent by the other 3 nodes in the
>         run_lapw execution.
>                      Could someone imagine a reason for this ? Any advice ?
------------------------------------------------------------------
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
------------------------------------------------------------------
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: lyu at ftiudm.ru, lyuka17 at mail.ru (office)
         lyuka17 at gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
------------------------------------------------------------------