[Wien] segmentation fault in lapwso
Laurence Marks
laurence.marks at gmail.com
Thu Aug 19 21:05:12 CEST 2021
I would not be so sure it is MKL. Did you check the version? Did you find a
way to reproduce it with -g and/or -C?
MKL is not perfect, but...
On Thu, Aug 19, 2021 at 1:14 PM Pavel Ondračka <pavel.ondracka at email.cz>
wrote:
> BTW I did the Valgrind run and there is nothing there (I don't have the
> affected MKL, but either with OpenBLAS or with the Netlib LAPACK/BLAS
> there are no Valgrind defects at all in the Wien2k code, just some
> harmless leaked memory.) So yeah, confirming this is definitelly MKL.
>
> Pavel
>
> On Thu, 2021-08-19 at 06:56 -0500, Laurence Marks wrote:
> > A suggestion: check your mkl version, as there is a mkl bug that was
> > recently fixed, see
> >
> https://urldefense.com/v3/__https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Problem-with-LAPACK-subroutine-ZHEEVR-input-array-quot-isuppz/td-p/1150816__;!!Dq0X2DkFhyF93HkjWTBQKhk!BG0vEdNiBkZCK3vqoyHfxw_N9B0iIybwSzUGRtERuXF8u0BRmgsH-bCEirX_5KLSo9nhrw$
> > _____
> > Professor Laurence Marks
> > "Research is to see what everybody else has seen, and to think what
> > nobody else has thought", Albert Szent-Györgyi
> > http://www.numis.northwestern.edu
> >
> > On Thu, Aug 19, 2021, 06:45 Peter Blaha
> > <pblaha at theochem.tuwien.ac.at> wrote:
> > > I'm still on vacations, so cannot test myself.
> > >
> > > However, I experienced such problems before. It has to do with
> > > multithreading (1 thread works always fine) and the mkl routine
> > > zheevr.
> > >
> > > In my case I could fix the problem by enlarging the workspace
> > > beyond
> > > what the routine calculates itself. (see comment in hmsec on line
> > > 841).
> > >
> > > Right below, the workspace was enlarged by a factor 10, which fixed
> > > my
> > > problem. But I can easily envision that it might not be enough in
> > > some
> > > other cases.
> > >
> > > An alternative is to switch back to zheevx (commented in the code).
> > >
> > > Peter Blaha
> > >
> > > Am 18.08.2021 um 20:01 schrieb Pavel Ondračka:
> > > > Right, I think that the reason deallocate is failing because the
> > > memory
> > > > has been corrupted at some earlier point is quite clear, the only
> > > other
> > > > option why it should crash would be that it was not allocated at
> > > all,
> > > > which seem not to be the case here... The question is what
> > > corrupted
> > > > the memory and even more strange is why does it work if we
> > > > disable
> > > MKL
> > > > multithreading?
> > > >
> > > > It could indeed be that we are doing something wrong. I can
> > > > imagine
> > > the
> > > > memory could be corrupted in some BLAS call if the number of
> > > > columns/rows passed to the specific BLAS call is more than the
> > > actual
> > > > size of the matrix, than this could easily happen (and the
> > > > multithreading is somehow influencing what the final value of the
> > > > corrupted memory, and depending on the final value the deallocate
> > > could
> > > > fail or pass somehow). This should be possible to diagnose with
> > > > valgrind as suggested.
> > > >
> > > > Luis, can you upload the testcase somewhere, or recompile with
> > > > debuginfo as suggested by Laurence earlier, run "valgrind --
> > > > track-
> > > > origins=yes lapwso lapwso.def" and send the output? Just be
> > > > warned,
> > > > there is a massive slowdown with valgrind (up to 100x) and the
> > > logfile
> > > > can get very large.
> > > >
> > > > Best regards
> > > > Pavel
> > > >
> > > >
> > > > On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote:
> > > > > Correction, I was looking at an older modules.F. It looks like
> > > > > it
> > > > > should be
> > > > >
> > > > > DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV
> > > > >
> > > > >
> > > > > On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks
> > > > > <laurence.marks at gmail.com> wrote:
> > > > > > I do wonder about this. I suggest editing module.F and
> > > > > > changing
> > > > > > lines 118 and 119 to
> > > > > > DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err
> > > > > > en
> > > > > > ',ien
> > > > > > DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne.
> > > > > > 0)write(*,*)'Err
> > > > > > vnorm ',Ivn
> > > > > >
> > > > > > There is every chance that the bug is not in those lines, but
> > > > > > somewhere completely different. SIGSEV often means that the
> > > > > > code
> > > > > > has been overwritten, for instance arrays going out of
> > > > > > bounds.
> > > > > >
> > > > > > You can also recompile with -g (don't change other options)
> > > > > > added, and/or -C. Sometimes this is better. Or use other
> > > > > > things
> > > > > > like debuggers or valgrind.
> > > > > >
> > > > > > On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka
> > > > > > <pavel.ondracka at email.cz> wrote:
> > > > > > > I'm CCing the list back as the crash was now diagnosed to a
> > > > > > > likely
> > > > > > > MKL
> > > > > > > problem, see below for more details.
> > > > > > > >
> > > > > > > > > So just to be clear, explicitly setting
> > > > > > > > > OMP_STACKSIZE=1g does
> > > > > > > not
> > > > > > > > > help
> > > > > > > > > to solve the issue?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not
> > > > > > > > solve
> > > > > > > > the
> > > > > > > > problem!
> > > > > > > >
> > > > > > > > > The problem is that the OpenMP code in lapwso is very
> > > > > > > > > simple,
> > > > > > > so I'm
> > > > > > > > > having problems seeing how it could be causing the
> > > > > > > > > problems.
> > > > > > > > >
> > > > > > > > > Could you also try to see what happens if run with:
> > > > > > > > > OMP_NUM_THREADS=1
> > > > > > > > > MKL_NUM_THREADS=4
> > > > > > > > >
> > > > > > > >
> > > > > > > > It does not work with these values, but I checked and it
> > > > > > > > works
> > > > > > > > reverting them:
> > > > > > > > OMP_NUM_THREADS=4
> > > > > > > > MKL_NUM_THREADS=1
> > > > > > > This was very helpfull and IMO points to a problem with MKL
> > > > > > > instead
> > > > > > > of
> > > > > > > Wien2k.
> > > > > > >
> > > > > > > Unfortunatelly setting MKL_NUM_THREADS=1 globally will
> > > > > > > reduce
> > > the
> > > > > > > OpenMP performance, mostly in lapw1 but also at other
> > > > > > > places. So
> > > > > > > if
> > > > > > > you
> > > > > > > want to keep the OpenMP BLAS/lapack level parallelism you
> > > > > > > have
> > > to
> > > > > > > either find some MKL version that works (if you do please
> > > > > > > report
> > > > > > > it
> > > > > > > here), link with OpenBLAS (using it for lapwso is enough)
> > > > > > > or
> > > > > > > create
> > > > > > > a
> > > > > > > simple wrapper that sets the MKL_NUM_THREADS=1 just for
> > > > > > > lapwso,
> > > > > > > i.e.,
> > > > > > > rename lapwso binary in WIENROOT to lapwso_bin and create
> > > > > > > new
> > > > > > > lapwso
> > > > > > > file there with:
> > > > > > >
> > > > > > > #!/bin/bash
> > > > > > > MKL_NUM_THREADS=1 lapwso_bin $1
> > > > > > >
> > > > > > > and set it to executable with chmod +x lapwso.
> > > > > > >
> > > > > > > Or maybe MKL has a non-OpenMP version which you could link
> > > > > > > with
> > > > > > > just
> > > > > > > lapwso and use standard one in other parts, but dunno, I
> > > > > > > mostly
> > > > > > > use
> > > > > > > OpenBLAS. If you need some further help, let me know.
> > > > > > >
> > > > > > > Reporting the issue to intel could be also nice, however I
> > > > > > > never
> > > > > > > had
> > > > > > > any real luck there and it is also a bit problematic as you
> > > can't
> > > > > > > provide testcase due to Wien2k being proprietary code...
> > > > > > >
> > > > > > > Best regards
> > > > > > > Pavel
> > > > > > >
> > > > > > > >
> > > > > > > > > This should disable the Wien2k-specific OpenMP
> > > > > > > > > parallelism
> > > > > > > > > but
> > > > > > > still
> > > > > > > > > keep the rest of paralellism at the BLAS/lapack level.
> > > > > > > > >
> > > > > > > >
> > > > > > > > So, perhaps, the problem is related to MKL!
> > > > > > > >
> > > > > > > > > Another option is that something is going wrong before
> > > > > > > > > lapwso
> > > > > > > and the
> > > > > > > > > lapwso crash is just the symptom. What happens if you
> > > > > > > > > run
> > > > > > > everything
> > > > > > > > > up
> > > > > > > > > to lapwso without OpenMP (OMP_NUM_THREADS=1) and than
> > > > > > > > > enable
> > > > > > > > > it
> > > > > > > just
> > > > > > > > > for lapwso?
> > > > > > > > >
> > > > > > > >
> > > > > > > > If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then
> > > > > > > > change
> > > > > > > it to 1
> > > > > > > > just before lapwso, it works.
> > > > > > > > If I do the opposite, starting with OMP_NUM_THREADS=1 and
> > > > > > > > then
> > > > > > > change
> > > > > > > > it to 4 just before lapwso, it does not work.
> > > > > > > > So I believe that the problem is really at lapwso.
> > > > > > > >
> > > > > > > > If you need more information, please, let me know!
> > > > > > > > All the best,
> > > > > > > > Luis
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Wien mailing list
> > > > > > > Wien at zeus.theochem.tuwien.ac.at
> > > > > > >
> > >
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$
> > > > > > >
> > > > > > > SEARCH the MAILING-LIST at:
> > > > > > >
> > >
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Professor Laurence Marks
> > > > > > Department of Materials Science and Engineering
> > > > > > Northwestern University
> > > > > > http://www.numis.northwestern.edu
> > > > > > "Research is to see what everybody else has seen, and to
> > > > > > think
> > > what
> > > > > > nobody else has thought" Albert Szent-Györgyi
> > > > >
> > > > > _______________________________________________
> > > > > Wien mailing list
> > > > > Wien at zeus.theochem.tuwien.ac.at
> > > > >
> > >
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> > >
> > > > > SEARCH the MAILING-LIST at:
> > > > >
> > >
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
> > >
> > > >
> > > > _______________________________________________
> > > > Wien mailing list
> > > > Wien at zeus.theochem.tuwien.ac.at
> > > >
> > >
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> > >
> > > > SEARCH the MAILING-LIST at:
> > >
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
> > >
> > >
> > > _______________________________________________
> > > Wien mailing list
> > > Wien at zeus.theochem.tuwien.ac.at
> > >
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!BG0vEdNiBkZCK3vqoyHfxw_N9B0iIybwSzUGRtERuXF8u0BRmgsH-bCEirX_5KLjCONQNA$
> > > SEARCH the MAILING-LIST at:
> > >
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!BG0vEdNiBkZCK3vqoyHfxw_N9B0iIybwSzUGRtERuXF8u0BRmgsH-bCEirX_5KL92k3Few$
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!BG0vEdNiBkZCK3vqoyHfxw_N9B0iIybwSzUGRtERuXF8u0BRmgsH-bCEirX_5KLjCONQNA$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!BG0vEdNiBkZCK3vqoyHfxw_N9B0iIybwSzUGRtERuXF8u0BRmgsH-bCEirX_5KL92k3Few$
>
--
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what nobody
else has thought" Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20210819/503dc114/attachment.htm>
More information about the Wien
mailing list