[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi
Natalia Pavlenko
natalia.pavlenko at physik.uni-augsburg.de
Sat Nov 23 12:54:07 CET 2013
Dear users,
I have a problem with parallel run of Wien2k 13.1 on a cluster
with Slurm Environment+ Intel mpi.
In a test run for 1 node with 6 cpu cores, I
generated the following .machines file:
-------.machines
#
lapw0:alcc69
1:alcc69:6
granularity:1
extrafine:1
---------------------------------
and used the following command in the script:
srun -n 6 runsp_lapw -NI -cc 0.0001 -i 50
In the first cycle, the lapw0,lapw1 and lapw2 are successfully
finished, but after that lcore and mixer continue to run in parallel
mode,
they intermix with lapw0 from the second cycle and cause a crash,
which can be seen from the output in case.dayfile:
--------------------------------------------------------------
cycle 1 (Fri Nov 22 15:32:51 CET 2013) (50/99 to go)
> lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0
> (15:32:51) > lapw0 (15:32:51) > lapw0
(15:32:51) > lapw0 (15:32:51) 44.798u 0.244s 0:45.75 98.4%
0+0k 0+0io 0pf+0w
> lapw1 -up (15:33:37)
> lapw1 -up (15:33:38)
> lapw1 -up (15:33:38)
> lapw1 -up (15:33:39)
> lapw1 -up (15:33:39)
> lapw1 -up (15:33:39) _nb in dscgst.F 512
> 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
> lapw1 -dn (16:12:48)
> lapw1 -dn (16:13:25)
> lapw1 -dn (16:13:29)
> lapw1 -dn (16:13:30)
> lapw1 -dn (16:13:42)
> lapw1 -dn (16:13:47) _nb in dscgst.F 512
> 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
_nb in dscgst.F 512 128
> lapw2 -up (17:07:01)
> lapw2 -up (17:07:57)
> lapw2 -up (17:08:44)
> lapw2 -up (17:08:52)
> lapw2 -dn (17:09:00)
> lapw2 -up (17:09:01)
> lapw2 -up (17:09:02)
> lapw2 -dn (17:09:52)
> lapw2 -dn (17:10:40)
> lapw2 -dn (17:10:56)
> lapw2 -dn (17:11:03)
> lapw2 -dn (17:11:13)
> lcore -up (17:11:40) 0.124u 0.024s 0:00.33 42.4% 0+0k 0+0io
> 0pf+0w
> lcore -dn (17:11:41) 0.120u 0.024s 0:00.30 46.6% 0+0k 0+0io
> 0pf+0w
> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io
> 0pf+0w
error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def
failed
> stop error
> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io
> 0pf+0w
> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io
> 0pf+0w
> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io
> 0pf+0w
error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def
failed
> stop error
> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io
> 0pf+0w
> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io
> 0pf+0w
> mixer (17:12:16) 0.680u 0.132s 0:02.28 35.5% 0+0k 0+0io
> 0pf+0w
:ENERGY convergence: 0 0 0
:CHARGE convergence: 0 0.0001 0
cycle 2 (Fri Nov 22 17:12:18 CET 2013) (49/98 to go)
> lapw0 (17:12:18)
> lcore -up (17:12:58) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io
> 0pf+0w
error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def
failed
> stop error
> lcore -up (17:13:02) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io
> 0pf+0w
error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def
failed
> stop error
------------------------------------------------------------------------------
It looks like the .machines file needs some additional details about
the calculation mode for lcore and mixer. How to configure properly the
.machines in this case?
Best regards, N.Pavlenko
--
Dr. Natalia Pavlenko
Institute of Physics, University of Augsburg
Universitätstr.1, 86135 Augsburg
Tel.: 0821-5983664
Fax: 0821-5983652
More information about the Wien
mailing list