[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi
Peter Blaha
pblaha at theochem.tuwien.ac.at
Sat Nov 23 13:07:29 CET 2013
You completely misunderstand how parallelization in wien2k works.
Please read the UG carefully (parallelization), also notice the
k-parallel and mpi-parallel options and for which case they are useful.
I'm not familiar with "Slurm", but it looks as if you ran 6 times the same
sequential job in parallel, overwriting the generated files all the time.
> I have a problem with parallel run of Wien2k 13.1 on a cluster
> with Slurm Environment+ Intel mpi.
> In a test run for 1 node with 6 cpu cores, I
> generated the following .machines file:
>
> -------.machines
> #
> lapw0:alcc69
> 1:alcc69:6
> granularity:1
> extrafine:1
this is ok, except the lapw0 line, which would not run in parallel. Use
lapw0:alc69:6
> and used the following command in the script:
> srun -n 6 runsp_lapw -NI -cc 0.0001 -i 50
You are running 6 times "runsp_lapw ..."
wien2k spans its parallelization itself (provided you have properly
installed wien2k and specified the proper "mpirun ... command" during
siteconfig), but you must add the -p flag.
So the single command
runsp_lapw -NI -cc 0.0001 -i 50 -p
should start 6 parallel jobs (with your machines file mpi-parallel) itself.
(You only need to have permission to do so).
>
> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
> finished, but after that lcore and mixer continue to run in parallel mode,
> they intermix with lapw0 from the second cycle and cause a crash,
> which can be seen from the output in case.dayfile:
>
> --------------------------------------------------------------
> cycle 1 (Fri Nov 22 15:32:51 CET 2013) (50/99 to go)
>
>> lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0
> (15:32:51) > lapw0 (15:32:51) 44.798u 0.244s 0:45.75 98.4% 0+0k 0+0io 0pf+0w
>> lapw1 -up (15:33:37)
>> lapw1 -up (15:33:38)
>> lapw1 -up (15:33:38)
>> lapw1 -up (15:33:39)
>> lapw1 -up (15:33:39)
>> lapw1 -up (15:33:39) _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
>> lapw1 -dn (16:12:48)
>> lapw1 -dn (16:13:25)
>> lapw1 -dn (16:13:29)
>> lapw1 -dn (16:13:30)
>> lapw1 -dn (16:13:42)
>> lapw1 -dn (16:13:47) _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
> _nb in dscgst.F 512 128
>> lapw2 -up (17:07:01)
>> lapw2 -up (17:07:57)
>> lapw2 -up (17:08:44)
>> lapw2 -up (17:08:52)
>> lapw2 -dn (17:09:00)
>> lapw2 -up (17:09:01)
>> lapw2 -up (17:09:02)
>> lapw2 -dn (17:09:52)
>> lapw2 -dn (17:10:40)
>> lapw2 -dn (17:10:56)
>> lapw2 -dn (17:11:03)
>> lapw2 -dn (17:11:13)
>> lcore -up (17:11:40) 0.124u 0.024s 0:00.33 42.4% 0+0k 0+0io 0pf+0w
>> lcore -dn (17:11:41) 0.120u 0.024s 0:00.30 46.6% 0+0k 0+0io 0pf+0w
>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io 0pf+0w
> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>
>> stop error
>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io 0pf+0w
>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io 0pf+0w
>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io 0pf+0w
> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>
>> stop error
>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io 0pf+0w
>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io 0pf+0w
>> mixer (17:12:16) 0.680u 0.132s 0:02.28 35.5% 0+0k 0+0io 0pf+0w
> :ENERGY convergence: 0 0 0
> :CHARGE convergence: 0 0.0001 0
>
> cycle 2 (Fri Nov 22 17:12:18 CET 2013) (49/98 to go)
>
>> lapw0 (17:12:18)
>> lcore -up (17:12:58) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>
>> stop error
>> lcore -up (17:13:02) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>
>> stop error
> ------------------------------------------------------------------------------
>
> It looks like the .machines file needs some additional details about
> the calculation mode for lcore and mixer. How to configure properly the
> .machines in this case?
>
>
> Best regards, N.Pavlenko
>
>
--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------
More information about the Wien
mailing list