[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Natalia Pavlenko natalia.pavlenko at physik.uni-augsburg.de
Sat Nov 23 12:54:07 CET 2013


Dear users,

I have a problem with parallel run of Wien2k 13.1 on a cluster
with Slurm Environment+ Intel mpi.
In a test run for 1 node with 6 cpu cores, I
generated the following .machines file:

-------.machines
#
lapw0:alcc69
1:alcc69:6
granularity:1
extrafine:1
---------------------------------
and  used the following command in the script:
srun -n 6  runsp_lapw -NI -cc 0.0001 -i 50

In the first cycle, the lapw0,lapw1 and lapw2 are successfully
finished, but after that lcore and mixer continue to run in parallel 
mode,
they intermix with lapw0 from the second cycle and cause a crash,
which can be seen from the output in case.dayfile:

--------------------------------------------------------------
     cycle 1     (Fri Nov 22 15:32:51 CET 2013)  (50/99 to go)

>   lapw0       (15:32:51) >   lapw0    (15:32:51) >   lapw0    
> (15:32:51) >   lapw0    (15:32:51) >   lapw0
     (15:32:51) >   lapw0    (15:32:51) 44.798u 0.244s 0:45.75 98.4% 
0+0k 0+0io 0pf+0w
>   lapw1  -up          (15:33:37)
>   lapw1  -up          (15:33:38)
>   lapw1  -up          (15:33:38)
>   lapw1  -up          (15:33:39)
>   lapw1  -up          (15:33:39)
>   lapw1  -up          (15:33:39)  _nb in dscgst.F         512         
> 128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
>   lapw1  -dn          (16:12:48)
>   lapw1  -dn          (16:13:25)
>   lapw1  -dn          (16:13:29)
>   lapw1  -dn          (16:13:30)
>   lapw1  -dn          (16:13:42)
>   lapw1  -dn          (16:13:47)  _nb in dscgst.F         512         
> 128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
>   lapw2 -up           (17:07:01)
>   lapw2 -up           (17:07:57)
>   lapw2 -up           (17:08:44)
>   lapw2 -up           (17:08:52)
>   lapw2 -dn           (17:09:00)
>   lapw2 -up           (17:09:01)
>   lapw2 -up           (17:09:02)
>   lapw2 -dn           (17:09:52)
>   lapw2 -dn           (17:10:40)
>   lapw2 -dn           (17:10:56)
>   lapw2 -dn           (17:11:03)
>   lapw2 -dn           (17:11:13)
>   lcore -up   (17:11:40) 0.124u 0.024s 0:00.33 42.4%  0+0k 0+0io 
> 0pf+0w
>   lcore -dn   (17:11:41) 0.120u 0.024s 0:00.30 46.6%  0+0k 0+0io 
> 0pf+0w
>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 
> 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def   
failed

>   stop error
>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 
> 0pf+0w
>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 
> 0pf+0w
>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 
> 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def   
failed

>   stop error
>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 
> 0pf+0w
>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 
> 0pf+0w
>   mixer       (17:12:16) 0.680u 0.132s 0:02.28 35.5%  0+0k 0+0io 
> 0pf+0w
:ENERGY convergence:  0 0 0
:CHARGE convergence:  0 0.0001 0

     cycle 2     (Fri Nov 22 17:12:18 CET 2013)  (49/98 to go)

>   lapw0       (17:12:18)
>   lcore -up   (17:12:58) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 
> 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def  
failed

>   stop error
>   lcore -up   (17:13:02) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 
> 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def  
failed

>   stop error
------------------------------------------------------------------------------

It looks like the .machines file needs some additional details about
the calculation mode for lcore and mixer. How to configure properly the
.machines in this case?


Best regards, N.Pavlenko


-- 
Dr. Natalia Pavlenko
Institute of Physics, University of Augsburg
Universitätstr.1, 86135 Augsburg
Tel.: 0821-5983664
Fax: 0821-5983652


More information about the Wien mailing list