[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Peter Blaha pblaha at theochem.tuwien.ac.at
Sat Nov 23 13:07:29 CET 2013


You completely misunderstand how parallelization in wien2k works.
Please read the UG carefully (parallelization), also notice the
k-parallel and mpi-parallel options and for which case they are useful.

I'm not familiar with "Slurm", but it looks as if you ran 6 times the same
sequential job in parallel, overwriting the generated files all the time.

> I have a problem with parallel run of Wien2k 13.1 on a cluster
> with Slurm Environment+ Intel mpi.
> In a test run for 1 node with 6 cpu cores, I
> generated the following .machines file:
>
> -------.machines
> #
> lapw0:alcc69
> 1:alcc69:6
> granularity:1
> extrafine:1

this is ok, except the lapw0 line, which would not run in parallel. Use

lapw0:alc69:6

> and  used the following command in the script:
> srun -n 6  runsp_lapw -NI -cc 0.0001 -i 50

You are running 6 times  "runsp_lapw ..."

wien2k spans its parallelization itself (provided you have properly
installed wien2k and specified the proper "mpirun ... command" during
siteconfig), but you must add the  -p flag.

So the single command

runsp_lapw -NI -cc 0.0001 -i 50 -p

should start 6 parallel jobs (with your machines file mpi-parallel) itself.
(You only need to have permission to do so).

>
> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
> finished, but after that lcore and mixer continue to run in parallel mode,
> they intermix with lapw0 from the second cycle and cause a crash,
> which can be seen from the output in case.dayfile:
>
> --------------------------------------------------------------
>      cycle 1     (Fri Nov 22 15:32:51 CET 2013)  (50/99 to go)
>
>>   lapw0       (15:32:51) >   lapw0    (15:32:51) >   lapw0 (15:32:51) >   lapw0    (15:32:51) >   lapw0
>      (15:32:51) >   lapw0    (15:32:51) 44.798u 0.244s 0:45.75 98.4% 0+0k 0+0io 0pf+0w
>>   lapw1  -up          (15:33:37)
>>   lapw1  -up          (15:33:38)
>>   lapw1  -up          (15:33:38)
>>   lapw1  -up          (15:33:39)
>>   lapw1  -up          (15:33:39)
>>   lapw1  -up          (15:33:39)  _nb in dscgst.F         512 128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>>   lapw1  -dn          (16:12:48)
>>   lapw1  -dn          (16:13:25)
>>   lapw1  -dn          (16:13:29)
>>   lapw1  -dn          (16:13:30)
>>   lapw1  -dn          (16:13:42)
>>   lapw1  -dn          (16:13:47)  _nb in dscgst.F         512 128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>   _nb in dscgst.F         512         128
>>   lapw2 -up           (17:07:01)
>>   lapw2 -up           (17:07:57)
>>   lapw2 -up           (17:08:44)
>>   lapw2 -up           (17:08:52)
>>   lapw2 -dn           (17:09:00)
>>   lapw2 -up           (17:09:01)
>>   lapw2 -up           (17:09:02)
>>   lapw2 -dn           (17:09:52)
>>   lapw2 -dn           (17:10:40)
>>   lapw2 -dn           (17:10:56)
>>   lapw2 -dn           (17:11:03)
>>   lapw2 -dn           (17:11:13)
>>   lcore -up   (17:11:40) 0.124u 0.024s 0:00.33 42.4%  0+0k 0+0io 0pf+0w
>>   lcore -dn   (17:11:41) 0.120u 0.024s 0:00.30 46.6%  0+0k 0+0io 0pf+0w
>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 0pf+0w
> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>
>>   stop error
>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 0pf+0w
>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 0pf+0w
>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 0pf+0w
> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>
>>   stop error
>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 0pf+0w
>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 0pf+0w
>>   mixer       (17:12:16) 0.680u 0.132s 0:02.28 35.5%  0+0k 0+0io 0pf+0w
> :ENERGY convergence:  0 0 0
> :CHARGE convergence:  0 0.0001 0
>
>      cycle 2     (Fri Nov 22 17:12:18 CET 2013)  (49/98 to go)
>
>>   lapw0       (17:12:18)
>>   lcore -up   (17:12:58) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 0pf+0w
> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>
>>   stop error
>>   lcore -up   (17:13:02) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 0pf+0w
> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>
>>   stop error
> ------------------------------------------------------------------------------
>
> It looks like the .machines file needs some additional details about
> the calculation mode for lcore and mixer. How to configure properly the
> .machines in this case?
>
>
> Best regards, N.Pavlenko
>
>

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------


More information about the Wien mailing list