[Wien] stubborn segmentation fault
Stefaan Cottenier
Stefaan.Cottenier at UGent.be
Thu Oct 25 13:46:31 CEST 2012
Dear wien2k community,
I do not succeed to get wien2k running flawlessly on our university
cluster (Intel Xeon Harpertown (L5420)). For some cases, a reproducible
segmentation fault error appears in lapw2. Our very capable sysadmins
gave up, and blame it to 'a wien2k coding problem'. That's why I want to
describe the problem for you:
A) Description of the problem:
* It is a "forrtl: severe (174): SIGSEGV, segmentation fault occurred"
error, which appears in lapw2 with FOR in case.in2 (never with TOT). The
full screen output (compiled with ifort, including -g -traceback) for
k-point parallelization over 2 cores is:
LAPW2 - FERMI; weighs written
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
lapw2 0000000000484D28 l2main_ 893
l2main_tmp_.F
lapw2 00000000004A1C2D MAIN__ 564
lapw2_tmp_.F
lapw2 0000000000403C4C Unknown Unknown Unknown
libc.so.6 000000300081D994 Unknown Unknown Unknown
lapw2 0000000000403B59 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
lapw2 0000000000484D28 l2main_ 893
l2main_tmp_.F
lapw2 00000000004A1C2D MAIN__ 564
lapw2_tmp_.F
lapw2 0000000000403C4C Unknown Unknown Unknown
libc.so.6 000000300081D994 Unknown Unknown Unknown
lapw2 0000000000403B59 Unknown Unknown Unknown
* It appears only for a limited number of cases (say 20% of all the ones
I tried). The others run just fine.
* The problem appears only in parallel runs. If a case shows the
problem, one additional serial iteration is sufficient to complete the
scf-cycle.
* If the problem appears, it can be reproduced only by 'run_lapw -p'. If
one tries a manual 'parallel' execution as hereunder (which I thought
should execute exactly the same processes), the error does no show up:
lapw0 lapw0.def
lapw1 lapw1.def [1]
lapw2 lapw2.def [1]
lapw1 lapw1.def [2]
lapw2 lapw2.def [2]
...
B) Detailed analysis
Trying different compiler versions was the first guess. Three different
ifort versions were tested (including the celebrated 2011.3.174 that was
reported on the wien2k mailing list to work fine for v12.1), but all
result in the same error:
v2011.1.073
v2011.3.174
v2011.10.319
Next, I searched for the possible reason by going through all steps
described at the following link (a very useful piece of information for
this mailing list, I suggest to mention it in the FAQ):
http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/
All steps described there lead to no improvement up to the first half of
"possible cause #5". The second test described in #5 yields something,
however. When compiling with the additional options
-fp-stack-check -g -traceback -gen-interfaces -warn interfaces
there is the following compile crash for lapw2 :
c3fft_tmp_.F(267): error #6633: The type of the actual argument differs
from the type of the dummy argument. [WSAVE]
CALL CFFTB1 (N,C,WSAVE,WSAVE(IW1),WSAVE(IW2))
----------------------------------------^
compilation aborted for c3fft_tmp_.F (code 1)
When searching the wien2k mailing list for c3fft, it turns out there had
been problems before with this routine, and an updated version had been
provided one year ago (=before v12.1):
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2011-April/014541.html
It seems to have been a different problem, however, and both the present
version and that (slightly different) version of april 2011 give the
same compilation error.
Can anyone use this information to find a solution?
Thanks !
Stefaan
More information about the Wien
mailing list