[Wien] Problem with qrsh -V -INHERIT in wien2k 13

cesar cesar at unizar.es
Mon May 4 12:13:56 CEST 2015


Dear all,

Maybe someone can help me with this problem I seems to have with the 
command "qrsh".

I'am not sure what it is going on with some wien2k calculations in one 
node of 64 cpus of my cluster. The problem is related with the process 
of coping *.def* files to my working directory when lapw1c/2c_mpi runs. 
Sometimes it works fine but other times it fails, and that seems to be 
random. of course when I can create by hand these files in my directory 
and then wien2k works fine and I can by-pass the problem.

The administrator of my cluster said that it is not a problem of 
file-permissions. Maybe it is related with the network traffic? or with 
some re-initializations of the qrsh before any other execution?, or an 
specific waiting time parameter of my cluster to be set in wien2k macros 
or programs?....
Now, I'm using opemmpi, but I tried different parallelism-methods with 
similar results.

Have anyone an idea that I can suggest to my cluster administrator?,
He also said me that the same problem appears with the last wien2k 
version.


As the next lines show, the fail seems related with the command "qrsh -V 
-inherit...",

Sincerely,
César
------------------------------------------------------------------------
The output is the next:
64 nodes for this job: node046
starting parallel lapw1 at vie may  1 12:16:13 CEST 2015
->  starting parallel LAPW1 jobs at vie may  1 12:16:14 CEST 2015
running LAPW1 in parallel mode (using .machines)
32 number_of_parallel_jobs
[1] 48397
...
...
...
[32] 50043
[2]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[2] 50297
[14]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[13]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[11]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[10]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[9]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[7]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[5]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[4]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[3]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1]    Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[18]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[12]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[15]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1] 50399
[3] 50425
[4] 50452
[5] 50483
[17]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6] 50511
[7] 50556
[16]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[19]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8] 50663
[9] 50701
[10] 50737
[11] 50777
[12] 50819
[13] 50847
[14] 50893
[15] 50926
[16] 50956
[16]   Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[15] - Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[14] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[13] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[12] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[11] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[10] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[9]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[7]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[5]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[4]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[3]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[2]  + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[32] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[31] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[30] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[29] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[28] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[27] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[26] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[25] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[24] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[23] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[22] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[21] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[20] + Fin                           ( $remote $remotemachine "cd 
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
      node046 node046(7) /cm/shared/apps/sge/6.2u5p2/bin/lx26-amd64/qrsh 
-V -inherit node046 cd /home/zar21001/WIEN2k/system;time mpirun -np 2 
--mca ras ^gridengine -machinefile .machine1 
/cm/shared/apps/wien2k/wien2k_13.1_big/lapw1c_mpi lapw1_1.def;rm -f 
.lock_node0461
...
...
...
      node046 node046(1) /cm/shared/apps/sge/6.2u5p2/bin/lx26-amd64/qrsh 
-V -inherit node046 cd /home/zar21001/WIEN2k/system;time mpirun -np 2 
--mca ras ^gridengine -machinefile .machine16 
/cm/shared/apps/wien2k/wien2k_13.1_big/lapw1c_mpi lapw1_48.def;rm -f 
.lock_node04616
    Summary of lapw1para:
    node046       k=0     user=240        wallclock=0
**  LAPW1 crashed!
2.048u 32.466s 0:21.62 159.5%   0+0k 808+11600io 2pf+0w
error: command   /cm/shared/apps/wien2k/wien2k_13.1/lapw1cpara -c 
lapw1.def   failed
----------------------------------------------------------


More information about the Wien mailing list