[Wien] stubborn segmentation fault

Thu Oct 25 16:59:10 CEST 2012

a) the default FFT routines cannot be used with array checking as mentioned before.
Try to activate -DFFT2 or 3 and link the corresponding fftw2/3 libraries (the FFTW
libraries should be present anyway on a "University cluster").
Then you should be able to use again all possible "debugging switches".

b) manual calculation should not be what you indicated:
   The manual calculation should be
   x lapw0
   x lapw1 -p
   x lapw2 -p

or (assuming that you still have all def files from a previous run)
    lapw0 lapw0.def
    lapw1 lapw1_1.def
    lapw1 lapw1_2.def
    lapw2 lapw2.def 2      ! calculates EF and weight-files
    lapw2 lapw2_1.def 1
    lapw2 lapw2_2.def 2    ! if this runs ok, but run -p does not (both in the same environment/node/batch-job!!)
                           ! then I do not understand anything, except: depending on your parallel-setup (non-shared memory)
                             you should even execute the lapw1/2 lines as:
    ssh nodexx "cd $PWD;lapw2 lapw2_2.def 2"

c) check the source code and put print statements of all variables in the vicinity of line 893
    of l2main.F. (in my version this line is empty, so there "cannot be" a seg-fault. This would
    indicate that there is some other (earlier) problem:
    make sure   maxindex is outside the if-statement (I guess this was the earlier bug mentioned by Gavin)

...
               maxindex=(lmx+1)**2
            IF (force.AND.forcea(0,jatom)) THEN

!_REAL        lda=2*maxindex
!_COMPLEX     lda=maxindex
               ldb=ibpp_max*iblock
               ldc=lda

               do i_h_k=1,3
                  do index=1,maxindex
                     do i3=1,ibb
                        h_ablyl_hk(index,i3)=h_alyl(index,i3)*h_k(i_h_k,i3)
                     enddo
                  enddo
!                 aalm_buf_tmp=0.0d0
!_REAL           CALL dgemm('N','N',2*maxindex,nemax-nemin+1,ibb,1.d0, &
!_REAL                   h_ablyl_hk,lda,a(ii,nemin),ldb,1.d0,aalm_buf_tmp(1,nemin,i_h_k),ldc)
!_COMPLEX        CALL zgemm('N','N',maxindex,nemax-nemin+1,ibb,(1.d0,0.d0), &
!_COMPLEX                h_ablyl_hk,lda,a(ii,nemin),ldb,(1.d0,0.d0),aalm_buf_tmp(1,nemin,i_h_k),ldc)
!                 do num=nemin,nemax
!                    do index=1,maxindex
!                       aalm_buf(i_h_k,num,index)=aalm_buf(i_h_k,num,index)+aalm_buf_tmp(index,num)
!                    enddo
!                 enddo
               enddo
                                              --> this is my empty line 893
               do i_h_k=1,3
....

  Put print*, 'line xx',maxindex,ibpp_max,iblock,nemin + loop indices (i_h_k),.. statements several times into
this IF block (beginning, before/after dgemm(zgemm) (By the way: a "real" or "complex" case ?)
to find out where exactly it crashes (what it can print and what not !)
If it is a "compiler bug", it may even run after inserting the print statements.

Other reason: file-system problem ? FOR needs case.nsh_1/2 files ! Are these files ok ?

Best regards
Peter

Am 25.10.2012 13:46, schrieb Stefaan Cottenier:
>
> Dear wien2k community,
>
> I do not succeed to get wien2k running flawlessly on our university cluster (Intel Xeon Harpertown (L5420)). For some cases, a reproducible segmentation fault error appears in
> lapw2. Our very capable sysadmins gave up, and blame it to 'a wien2k coding problem'. That's why I want to describe the problem for you:
>
> A) Description of the problem:
>
> * It is a "forrtl: severe (174): SIGSEGV, segmentation fault occurred" error, which appears in lapw2 with FOR in case.in2 (never with TOT). The full screen output (compiled with
> ifort, including -g -traceback) for k-point parallelization over 2 cores is:
>
> LAPW2 - FERMI; weighs written
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image              PC                Routine            Line        Source
> lapw2              0000000000484D28  l2main_                   893 l2main_tmp_.F
> lapw2              00000000004A1C2D  MAIN__                    564 lapw2_tmp_.F
> lapw2              0000000000403C4C  Unknown               Unknown  Unknown
> libc.so.6          000000300081D994  Unknown               Unknown  Unknown
> lapw2              0000000000403B59  Unknown               Unknown  Unknown
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image              PC                Routine            Line        Source
> lapw2              0000000000484D28  l2main_                   893 l2main_tmp_.F
> lapw2              00000000004A1C2D  MAIN__                    564 lapw2_tmp_.F
> lapw2              0000000000403C4C  Unknown               Unknown  Unknown
> libc.so.6          000000300081D994  Unknown               Unknown  Unknown
> lapw2              0000000000403B59  Unknown               Unknown  Unknown
>
> * It appears only for a limited number of cases (say 20% of all the ones I tried). The others run just fine.
>
> * The problem appears only in parallel runs. If a case shows the problem, one additional serial iteration is sufficient to complete the scf-cycle.
>
> * If the problem appears, it can be reproduced only by 'run_lapw -p'. If one tries a manual 'parallel' execution as hereunder (which I thought should execute exactly the same
> processes), the error does no show up:
>
> lapw0 lapw0.def
> lapw1 lapw1.def [1]
> lapw2 lapw2.def [1]
> lapw1 lapw1.def [2]
> lapw2 lapw2.def [2]
> ...
>
>
> B) Detailed analysis
>
> Trying different compiler versions was the first guess. Three different ifort versions were tested (including the celebrated 2011.3.174 that was reported on the wien2k mailing
> list to work fine for v12.1), but all result in the same error:
>
> v2011.1.073
> v2011.3.174
> v2011.10.319
>
> Next, I searched for the possible reason by going through all steps described at the following link (a very useful piece of information for this mailing list, I suggest to mention
> it in the FAQ):
>
> http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/
>
> All steps described there lead to no improvement up to the first half of "possible cause #5". The second test described in #5 yields something, however. When compiling with the
> additional options
>
> -fp-stack-check -g -traceback -gen-interfaces -warn interfaces
>
> there is the following compile crash for lapw2 :
>
> c3fft_tmp_.F(267): error #6633: The type of the actual argument differs from the type of the dummy argument.   [WSAVE]
>        CALL CFFTB1 (N,C,WSAVE,WSAVE(IW1),WSAVE(IW2))
> ----------------------------------------^
> compilation aborted for c3fft_tmp_.F (code 1)
>
> When searching the wien2k mailing list for c3fft, it turns out there had been problems before with this routine, and an updated version had been provided one year ago (=before
> v12.1):
>
> http://zeus.theochem.tuwien.ac.at/pipermail/wien/2011-April/014541.html
>
> It seems to have been a different problem, however, and both the present version and that (slightly different) version of april 2011 give the same compilation error.
>
> Can anyone use this information to find a solution?
>
> Thanks !
>
> Stefaan
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------