[Wien] segmentation fault in lapwso

Laurence Marks laurence.marks at gmail.com
Thu Aug 19 13:56:28 CEST 2021


A suggestion: check your mkl version, as there is a mkl bug that was
recently fixed, see
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Problem-with-LAPACK-subroutine-ZHEEVR-input-array-quot-isuppz/td-p/1150816
_____
Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Györgyi
www.numis.northwestern.edu

On Thu, Aug 19, 2021, 06:45 Peter Blaha <pblaha at theochem.tuwien.ac.at>
wrote:

> I'm still on vacations, so cannot test myself.
>
> However, I experienced such problems before. It has to do with
> multithreading (1 thread works always fine) and the mkl routine zheevr.
>
> In my case I could fix the problem by enlarging the workspace beyond
> what the routine calculates itself. (see comment in hmsec on line 841).
>
> Right below, the workspace was enlarged by a factor 10, which fixed my
> problem. But I can easily envision that it might not be enough in some
> other cases.
>
> An alternative is to switch back to zheevx (commented in the code).
>
> Peter Blaha
>
> Am 18.08.2021 um 20:01 schrieb Pavel Ondračka:
> > Right, I think that the reason deallocate is failing because the memory
> > has been corrupted at some earlier point is quite clear, the only other
> > option why it should crash would be that it was not allocated at all,
> > which seem not to be the case here... The question is what corrupted
> > the memory and even more strange is why does it work if we disable MKL
> > multithreading?
> >
> > It could indeed be that we are doing something wrong. I can imagine the
> > memory could be corrupted in some BLAS call if the number of
> > columns/rows passed to the specific BLAS call is more than the actual
> > size of the matrix, than this could easily happen (and the
> > multithreading is somehow influencing what the final value of the
> > corrupted memory, and depending on the final value the deallocate could
> > fail or pass somehow). This should be possible to diagnose with
> > valgrind as suggested.
> >
> > Luis, can you upload the testcase somewhere, or recompile with
> > debuginfo as suggested by Laurence earlier, run "valgrind --track-
> > origins=yes lapwso lapwso.def" and send the output? Just be warned,
> > there is a massive slowdown with valgrind (up to 100x) and the logfile
> > can get very large.
> >
> > Best regards
> > Pavel
> >
> >
> > On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote:
> >> Correction, I was looking at an older modules.F. It looks like it
> >> should be
> >>
> >> DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV
> >>
> >>
> >> On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks
> >> <laurence.marks at gmail.com> wrote:
> >>> I do wonder about this. I suggest editing module.F and changing
> >>> lines 118 and 119 to
> >>>       DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err en
> >>> ',ien
> >>>       DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne. 0)write(*,*)'Err
> >>> vnorm ',Ivn
> >>>
> >>> There is every chance that the bug is not in those lines, but
> >>> somewhere completely different. SIGSEV often means that the code
> >>> has been overwritten, for instance arrays going out of bounds.
> >>>
> >>> You can also recompile with -g (don't change other options)
> >>> added, and/or -C. Sometimes this is better. Or use other things
> >>> like debuggers or valgrind.
> >>>
> >>> On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka
> >>> <pavel.ondracka at email.cz> wrote:
> >>>> I'm CCing the list back as the crash was now diagnosed to a
> >>>> likely
> >>>> MKL
> >>>> problem, see below for more details.
> >>>>>
> >>>>>> So just to be clear, explicitly setting OMP_STACKSIZE=1g does
> >>>> not
> >>>>>> help
> >>>>>> to solve the issue?
> >>>>>>
> >>>>>
> >>>>> Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not solve
> >>>>> the
> >>>>> problem!
> >>>>>
> >>>>>> The problem is that the OpenMP code in lapwso is very simple,
> >>>> so I'm
> >>>>>> having problems seeing how it could be causing the problems.
> >>>>>>
> >>>>>> Could you also try to see what happens if run with:
> >>>>>> OMP_NUM_THREADS=1
> >>>>>> MKL_NUM_THREADS=4
> >>>>>>
> >>>>>
> >>>>> It does not work with these values, but I checked and it works
> >>>>> reverting them:
> >>>>> OMP_NUM_THREADS=4
> >>>>> MKL_NUM_THREADS=1
> >>>> This was very helpfull and IMO points to a problem with MKL
> >>>> instead
> >>>> of
> >>>> Wien2k.
> >>>>
> >>>> Unfortunatelly setting MKL_NUM_THREADS=1 globally will reduce the
> >>>> OpenMP performance, mostly in lapw1 but also at other places. So
> >>>> if
> >>>> you
> >>>> want to keep the OpenMP BLAS/lapack level parallelism you have to
> >>>> either find some MKL version that works (if you do please report
> >>>> it
> >>>> here), link with OpenBLAS (using it for lapwso is enough) or
> >>>> create
> >>>> a
> >>>> simple wrapper that sets the MKL_NUM_THREADS=1 just for lapwso,
> >>>> i.e.,
> >>>> rename lapwso binary in WIENROOT to lapwso_bin and create new
> >>>> lapwso
> >>>> file there with:
> >>>>
> >>>> #!/bin/bash
> >>>> MKL_NUM_THREADS=1 lapwso_bin $1
> >>>>
> >>>> and set it to executable with chmod +x lapwso.
> >>>>
> >>>> Or maybe MKL has a non-OpenMP version which you could link with
> >>>> just
> >>>> lapwso and use standard one in other parts, but dunno, I mostly
> >>>> use
> >>>> OpenBLAS. If you need some further help, let me know.
> >>>>
> >>>> Reporting the issue to intel could be also nice, however I never
> >>>> had
> >>>> any real luck there and it is also a bit problematic as you can't
> >>>> provide testcase due to Wien2k being proprietary code...
> >>>>
> >>>> Best regards
> >>>> Pavel
> >>>>
> >>>>>
> >>>>>> This should disable the Wien2k-specific OpenMP parallelism
> >>>>>> but
> >>>> still
> >>>>>> keep the rest of paralellism at the BLAS/lapack level.
> >>>>>>
> >>>>>
> >>>>> So, perhaps, the problem is related to MKL!
> >>>>>
> >>>>>> Another option is that something is going wrong before lapwso
> >>>> and the
> >>>>>> lapwso crash is just the symptom. What happens if you run
> >>>> everything
> >>>>>> up
> >>>>>> to lapwso without OpenMP (OMP_NUM_THREADS=1) and than enable
> >>>>>> it
> >>>> just
> >>>>>> for lapwso?
> >>>>>>
> >>>>>
> >>>>> If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then change
> >>>> it to 1
> >>>>> just before lapwso, it works.
> >>>>> If I do the opposite, starting with OMP_NUM_THREADS=1 and then
> >>>> change
> >>>>> it to 4 just before lapwso, it does not work.
> >>>>> So I believe that the problem is really at lapwso.
> >>>>>
> >>>>>     If you need more information, please, let me know!
> >>>>>     All the best,
> >>>>>              Luis
> >>>>
> >>>> _______________________________________________
> >>>> Wien mailing list
> >>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$
> >>>>
> >>>> SEARCH the MAILING-LIST at:
> >>>>
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$
> >>>>
> >>>
> >>> --
> >>> Professor Laurence Marks
> >>> Department of Materials Science and Engineering
> >>> Northwestern University
> >>> http://www.numis.northwestern.edu
> >>> "Research is to see what everybody else has seen, and to think what
> >>> nobody else has thought" Albert Szent-Györgyi
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> >> SEARCH the MAILING-LIST at:
> >>
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> >
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> > SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
>
> --
> -----------------------------------------------------------------------
> Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-158801165300
> Email: peter.blaha at tuwien.ac.at
> WWW:
> https://urldefense.com/v3/__http://www.imc.tuwien.ac.at__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GOq7u_3g$
>      WIEN2k:
> https://urldefense.com/v3/__http://www.wien2k.at__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4HnwEsS1A$
> -------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20210819/852f46cd/attachment.htm>


More information about the Wien mailing list