[Wien] stubborn segmentation fault
Peter Blaha
pblaha at theochem.tuwien.ac.at
Thu Oct 25 16:59:10 CEST 2012
a) the default FFT routines cannot be used with array checking as mentioned before.
Try to activate -DFFT2 or 3 and link the corresponding fftw2/3 libraries (the FFTW
libraries should be present anyway on a "University cluster").
Then you should be able to use again all possible "debugging switches".
b) manual calculation should not be what you indicated:
The manual calculation should be
x lapw0
x lapw1 -p
x lapw2 -p
or (assuming that you still have all def files from a previous run)
lapw0 lapw0.def
lapw1 lapw1_1.def
lapw1 lapw1_2.def
lapw2 lapw2.def 2 ! calculates EF and weight-files
lapw2 lapw2_1.def 1
lapw2 lapw2_2.def 2 ! if this runs ok, but run -p does not (both in the same environment/node/batch-job!!)
! then I do not understand anything, except: depending on your parallel-setup (non-shared memory)
you should even execute the lapw1/2 lines as:
ssh nodexx "cd $PWD;lapw2 lapw2_2.def 2"
c) check the source code and put print statements of all variables in the vicinity of line 893
of l2main.F. (in my version this line is empty, so there "cannot be" a seg-fault. This would
indicate that there is some other (earlier) problem:
make sure maxindex is outside the if-statement (I guess this was the earlier bug mentioned by Gavin)
...
maxindex=(lmx+1)**2
IF (force.AND.forcea(0,jatom)) THEN
!_REAL lda=2*maxindex
!_COMPLEX lda=maxindex
ldb=ibpp_max*iblock
ldc=lda
do i_h_k=1,3
do index=1,maxindex
do i3=1,ibb
h_ablyl_hk(index,i3)=h_alyl(index,i3)*h_k(i_h_k,i3)
enddo
enddo
! aalm_buf_tmp=0.0d0
!_REAL CALL dgemm('N','N',2*maxindex,nemax-nemin+1,ibb,1.d0, &
!_REAL h_ablyl_hk,lda,a(ii,nemin),ldb,1.d0,aalm_buf_tmp(1,nemin,i_h_k),ldc)
!_COMPLEX CALL zgemm('N','N',maxindex,nemax-nemin+1,ibb,(1.d0,0.d0), &
!_COMPLEX h_ablyl_hk,lda,a(ii,nemin),ldb,(1.d0,0.d0),aalm_buf_tmp(1,nemin,i_h_k),ldc)
! do num=nemin,nemax
! do index=1,maxindex
! aalm_buf(i_h_k,num,index)=aalm_buf(i_h_k,num,index)+aalm_buf_tmp(index,num)
! enddo
! enddo
enddo
--> this is my empty line 893
do i_h_k=1,3
....
Put print*, 'line xx',maxindex,ibpp_max,iblock,nemin + loop indices (i_h_k),.. statements several times into
this IF block (beginning, before/after dgemm(zgemm) (By the way: a "real" or "complex" case ?)
to find out where exactly it crashes (what it can print and what not !)
If it is a "compiler bug", it may even run after inserting the print statements.
Other reason: file-system problem ? FOR needs case.nsh_1/2 files ! Are these files ok ?
Best regards
Peter
Am 25.10.2012 13:46, schrieb Stefaan Cottenier:
>
> Dear wien2k community,
>
> I do not succeed to get wien2k running flawlessly on our university cluster (Intel Xeon Harpertown (L5420)). For some cases, a reproducible segmentation fault error appears in
> lapw2. Our very capable sysadmins gave up, and blame it to 'a wien2k coding problem'. That's why I want to describe the problem for you:
>
> A) Description of the problem:
>
> * It is a "forrtl: severe (174): SIGSEGV, segmentation fault occurred" error, which appears in lapw2 with FOR in case.in2 (never with TOT). The full screen output (compiled with
> ifort, including -g -traceback) for k-point parallelization over 2 cores is:
>
> LAPW2 - FERMI; weighs written
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> lapw2 0000000000484D28 l2main_ 893 l2main_tmp_.F
> lapw2 00000000004A1C2D MAIN__ 564 lapw2_tmp_.F
> lapw2 0000000000403C4C Unknown Unknown Unknown
> libc.so.6 000000300081D994 Unknown Unknown Unknown
> lapw2 0000000000403B59 Unknown Unknown Unknown
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> lapw2 0000000000484D28 l2main_ 893 l2main_tmp_.F
> lapw2 00000000004A1C2D MAIN__ 564 lapw2_tmp_.F
> lapw2 0000000000403C4C Unknown Unknown Unknown
> libc.so.6 000000300081D994 Unknown Unknown Unknown
> lapw2 0000000000403B59 Unknown Unknown Unknown
>
> * It appears only for a limited number of cases (say 20% of all the ones I tried). The others run just fine.
>
> * The problem appears only in parallel runs. If a case shows the problem, one additional serial iteration is sufficient to complete the scf-cycle.
>
> * If the problem appears, it can be reproduced only by 'run_lapw -p'. If one tries a manual 'parallel' execution as hereunder (which I thought should execute exactly the same
> processes), the error does no show up:
>
> lapw0 lapw0.def
> lapw1 lapw1.def [1]
> lapw2 lapw2.def [1]
> lapw1 lapw1.def [2]
> lapw2 lapw2.def [2]
> ...
>
>
> B) Detailed analysis
>
> Trying different compiler versions was the first guess. Three different ifort versions were tested (including the celebrated 2011.3.174 that was reported on the wien2k mailing
> list to work fine for v12.1), but all result in the same error:
>
> v2011.1.073
> v2011.3.174
> v2011.10.319
>
> Next, I searched for the possible reason by going through all steps described at the following link (a very useful piece of information for this mailing list, I suggest to mention
> it in the FAQ):
>
> http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/
>
> All steps described there lead to no improvement up to the first half of "possible cause #5". The second test described in #5 yields something, however. When compiling with the
> additional options
>
> -fp-stack-check -g -traceback -gen-interfaces -warn interfaces
>
> there is the following compile crash for lapw2 :
>
> c3fft_tmp_.F(267): error #6633: The type of the actual argument differs from the type of the dummy argument. [WSAVE]
> CALL CFFTB1 (N,C,WSAVE,WSAVE(IW1),WSAVE(IW2))
> ----------------------------------------^
> compilation aborted for c3fft_tmp_.F (code 1)
>
> When searching the wien2k mailing list for c3fft, it turns out there had been problems before with this routine, and an updated version had been provided one year ago (=before
> v12.1):
>
> http://zeus.theochem.tuwien.ac.at/pipermail/wien/2011-April/014541.html
>
> It seems to have been a different problem, however, and both the present version and that (slightly different) version of april 2011 give the same compilation error.
>
> Can anyone use this information to find a solution?
>
> Thanks !
>
> Stefaan
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list