[Wien] k point parallel calculations
Priyanka Seth
priyanka.seth at polytechnique.edu
Tue Feb 24 19:19:14 CET 2015
Hello all,
I have been trying to run some k-point parallel calculations for some
large structures and have been having problems for versions 12, 13 and
14 on an ifort compilation. In all cases, I am running on the same
number of cores as k vectors. Note that calculations begun from the same
input and run on a single core calculation run without any problems.
v12/v13
=====
This is the output for versions 12 and 13 (I've removed the
node-dependent lines):
LAPW0 END
LAPW1 END
LAPW2 - FERMI; weighs written
LAPW2 END
SUMPARA END
CORE END
forrtl: severe (59): list-directed I/O syntax error, unit -5, file
Internal List-Directed Read
Image PC Routine Line Source
mixer 000000000051693D Unknown Unknown Unknown
mixer 0000000000515445 Unknown Unknown Unknown
mixer 00000000004BC9E0 Unknown Unknown Unknown
mixer 000000000046F4BA Unknown Unknown Unknown
mixer 000000000046ECB0 Unknown Unknown Unknown
mixer 0000000000492B76 Unknown Unknown Unknown
mixer 000000000049043B Unknown Unknown Unknown
mixer 0000000000407E7E MAIN__ 168 mixer.F
mixer 000000000040414C Unknown Unknown Unknown
libc.so.6 00000037C241D994 Unknown Unknown Unknown
mixer 0000000000403FC9 Unknown Unknown Unknown
> stop error
Looking at the error files, I have "Error in MIXER" in both versions.
The dayfile ends as follows:
1.884u 0.844s 0:09.73 27.9% 0+0k 0+0io 8pf+0w
> lcore (09:33:51) 0.046u 0.007s 0:00.14 28.5% 0+0k 0+0io 7pf+0w
> mixer (09:33:51) 0.000u 0.005s 0:00.04 0.0% 0+0k 0+0io 8pf+0w
error: command /home/pseth/SOURCES/WIEN2K_v13/mixer mixer.def failed
> stop error
v14
===
I get to the second cycle, but then the calculation crashes with "Error
in LAPW1" in lapw1_*.error:
LAPW2 END
SUMPARA END
CORE END
MIXER END
ec cc and fc_conv 0 0 1
in cycle 2 ETEST: 0 CTEST: 0
LAPW0 END
There is nothing obviously wrong looking at the case.scf1_* files or at
the dayfile which ends like this:
> lapw1 -p (09:37:40) starting parallel lapw1 at Tue Feb
10 09:37:40 CET 2015
-> starting parallel LAPW1 jobs at Tue Feb 10 09:37:40 CET 2015
running LAPW1 in parallel mode (using .machines)
24 number_of_parallel_jobs
[1] 30405
[2] 30437
[3] 30471
[4] 30507
[5] 30559
[6] 30606
[7] 30653
[8] 30717
[9] 30809
[10] 30916
[11] 31000
[12] 31070
[13] 31192
[14] 31329
[15] 31428
[16] 31504
[17] 31664
[18] 31788
[19] 31871
[20] 31900
[21] 31928
[22] 31956
[23] 31982
[24] 32010
[5] Done ( ( $remote $machine[$p] ...
I understand that this is not much information to go on, but I don't
really know where else to look! Has anyone had similar issues? What else
would help in diagnosing the problem?
Many thanks,
Priyanka
More information about the Wien
mailing list