[Wien] Problem with qrsh -V -INHERIT in wien2k 13
cesar
cesar at unizar.es
Mon May 4 12:13:56 CEST 2015
Dear all,
Maybe someone can help me with this problem I seems to have with the
command "qrsh".
I'am not sure what it is going on with some wien2k calculations in one
node of 64 cpus of my cluster. The problem is related with the process
of coping *.def* files to my working directory when lapw1c/2c_mpi runs.
Sometimes it works fine but other times it fails, and that seems to be
random. of course when I can create by hand these files in my directory
and then wien2k works fine and I can by-pass the problem.
The administrator of my cluster said that it is not a problem of
file-permissions. Maybe it is related with the network traffic? or with
some re-initializations of the qrsh before any other execution?, or an
specific waiting time parameter of my cluster to be set in wien2k macros
or programs?....
Now, I'm using opemmpi, but I tried different parallelism-methods with
similar results.
Have anyone an idea that I can suggest to my cluster administrator?,
He also said me that the same problem appears with the last wien2k
version.
As the next lines show, the fail seems related with the command "qrsh -V
-inherit...",
Sincerely,
César
------------------------------------------------------------------------
The output is the next:
64 nodes for this job: node046
starting parallel lapw1 at vie may 1 12:16:13 CEST 2015
-> starting parallel LAPW1 jobs at vie may 1 12:16:14 CEST 2015
running LAPW1 in parallel mode (using .machines)
32 number_of_parallel_jobs
[1] 48397
...
...
...
[32] 50043
[2] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[2] 50297
[14] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[13] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[11] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[10] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[9] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[7] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[5] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[4] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[3] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[18] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[12] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[15] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1] 50399
[3] 50425
[4] 50452
[5] 50483
[17] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6] 50511
[7] 50556
[16] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[19] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8] 50663
[9] 50701
[10] 50737
[11] 50777
[12] 50819
[13] 50847
[14] 50893
[15] 50926
[16] 50956
[16] Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[15] - Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[14] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[13] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[12] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[11] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[10] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[9] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[8] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[7] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[6] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[5] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[4] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[3] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[1] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[2] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[32] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[31] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[30] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[29] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[28] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[27] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[26] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[25] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[24] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[23] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[22] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[21] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
[20] + Fin ( $remote $remotemachine "cd
$PWD;$t $ttt;rm -f .lock_$lockfile[$p]" ) >> .time1_$loop
node046 node046(7) /cm/shared/apps/sge/6.2u5p2/bin/lx26-amd64/qrsh
-V -inherit node046 cd /home/zar21001/WIEN2k/system;time mpirun -np 2
--mca ras ^gridengine -machinefile .machine1
/cm/shared/apps/wien2k/wien2k_13.1_big/lapw1c_mpi lapw1_1.def;rm -f
.lock_node0461
...
...
...
node046 node046(1) /cm/shared/apps/sge/6.2u5p2/bin/lx26-amd64/qrsh
-V -inherit node046 cd /home/zar21001/WIEN2k/system;time mpirun -np 2
--mca ras ^gridengine -machinefile .machine16
/cm/shared/apps/wien2k/wien2k_13.1_big/lapw1c_mpi lapw1_48.def;rm -f
.lock_node04616
Summary of lapw1para:
node046 k=0 user=240 wallclock=0
** LAPW1 crashed!
2.048u 32.466s 0:21.62 159.5% 0+0k 808+11600io 2pf+0w
error: command /cm/shared/apps/wien2k/wien2k_13.1/lapw1cpara -c
lapw1.def failed
----------------------------------------------------------
More information about the Wien
mailing list