[Wien] segmentation fault in lapwso

Pavel Ondračka pavel.ondracka at email.cz
Thu Aug 19 20:13:49 CEST 2021


BTW I did the Valgrind run and there is nothing there (I don't have the
affected MKL, but either with OpenBLAS or with the Netlib LAPACK/BLAS
there are no Valgrind defects at all in the Wien2k code, just some
harmless leaked memory.) So yeah, confirming this is definitelly MKL.

Pavel

On Thu, 2021-08-19 at 06:56 -0500, Laurence Marks wrote:
> A suggestion: check your mkl version, as there is a mkl bug that was
> recently fixed, see
> https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Problem-with-LAPACK-subroutine-ZHEEVR-input-array-quot-isuppz/td-p/1150816
> _____
> Professor Laurence Marks
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought", Albert Szent-Györgyi
> www.numis.northwestern.edu
> 
> On Thu, Aug 19, 2021, 06:45 Peter Blaha
> <pblaha at theochem.tuwien.ac.at> wrote:
> > I'm still on vacations, so cannot test myself.
> > 
> > However, I experienced such problems before. It has to do with 
> > multithreading (1 thread works always fine) and the mkl routine
> > zheevr.
> > 
> > In my case I could fix the problem by enlarging the workspace
> > beyond 
> > what the routine calculates itself. (see comment in hmsec on line
> > 841).
> > 
> > Right below, the workspace was enlarged by a factor 10, which fixed
> > my 
> > problem. But I can easily envision that it might not be enough in
> > some 
> > other cases.
> > 
> > An alternative is to switch back to zheevx (commented in the code).
> > 
> > Peter Blaha
> > 
> > Am 18.08.2021 um 20:01 schrieb Pavel Ondračka:
> > > Right, I think that the reason deallocate is failing because the
> > memory
> > > has been corrupted at some earlier point is quite clear, the only
> > other
> > > option why it should crash would be that it was not allocated at
> > all,
> > > which seem not to be the case here... The question is what
> > corrupted
> > > the memory and even more strange is why does it work if we
> > > disable
> > MKL
> > > multithreading?
> > > 
> > > It could indeed be that we are doing something wrong. I can
> > > imagine
> > the
> > > memory could be corrupted in some BLAS call if the number of
> > > columns/rows passed to the specific BLAS call is more than the
> > actual
> > > size of the matrix, than this could easily happen (and the
> > > multithreading is somehow influencing what the final value of the
> > > corrupted memory, and depending on the final value the deallocate
> > could
> > > fail or pass somehow). This should be possible to diagnose with
> > > valgrind as suggested.
> > > 
> > > Luis, can you upload the testcase somewhere, or recompile with
> > > debuginfo as suggested by Laurence earlier, run "valgrind --
> > > track-
> > > origins=yes lapwso lapwso.def" and send the output? Just be
> > > warned,
> > > there is a massive slowdown with valgrind (up to 100x) and the
> > logfile
> > > can get very large.
> > > 
> > > Best regards
> > > Pavel
> > > 
> > > 
> > > On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote:
> > > > Correction, I was looking at an older modules.F. It looks like
> > > > it
> > > > should be
> > > > 
> > > > DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV
> > > > 
> > > > 
> > > > On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks
> > > > <laurence.marks at gmail.com> wrote:
> > > > > I do wonder about this. I suggest editing module.F and
> > > > > changing
> > > > > lines 118 and 119 to
> > > > >        DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err
> > > > > en
> > > > > ',ien
> > > > >        DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne.
> > > > > 0)write(*,*)'Err
> > > > > vnorm ',Ivn
> > > > > 
> > > > > There is every chance that the bug is not in those lines, but
> > > > > somewhere completely different. SIGSEV often means that the
> > > > > code
> > > > > has been overwritten, for instance arrays going out of
> > > > > bounds.
> > > > > 
> > > > > You can also recompile with -g (don't change other options)
> > > > > added, and/or -C. Sometimes this is better. Or use other
> > > > > things
> > > > > like debuggers or valgrind.
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka
> > > > > <pavel.ondracka at email.cz> wrote:
> > > > > > I'm CCing the list back as the crash was now diagnosed to a
> > > > > > likely
> > > > > > MKL
> > > > > > problem, see below for more details.
> > > > > > > 
> > > > > > > > So just to be clear, explicitly setting
> > > > > > > > OMP_STACKSIZE=1g does
> > > > > > not
> > > > > > > > help
> > > > > > > > to solve the issue?
> > > > > > > > 
> > > > > > > 
> > > > > > > Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not
> > > > > > > solve
> > > > > > > the
> > > > > > > problem!
> > > > > > >    
> > > > > > > > The problem is that the OpenMP code in lapwso is very
> > > > > > > > simple,
> > > > > > so I'm
> > > > > > > > having problems seeing how it could be causing the
> > > > > > > > problems.
> > > > > > > > 
> > > > > > > > Could you also try to see what happens if run with:
> > > > > > > > OMP_NUM_THREADS=1
> > > > > > > > MKL_NUM_THREADS=4
> > > > > > > > 
> > > > > > > 
> > > > > > > It does not work with these values, but I checked and it
> > > > > > > works
> > > > > > > reverting them:
> > > > > > > OMP_NUM_THREADS=4
> > > > > > > MKL_NUM_THREADS=1
> > > > > > This was very helpfull and IMO points to a problem with MKL
> > > > > > instead
> > > > > > of
> > > > > > Wien2k.
> > > > > > 
> > > > > > Unfortunatelly setting MKL_NUM_THREADS=1 globally will
> > > > > > reduce
> > the
> > > > > > OpenMP performance, mostly in lapw1 but also at other
> > > > > > places. So
> > > > > > if
> > > > > > you
> > > > > > want to keep the OpenMP BLAS/lapack level parallelism you
> > > > > > have
> > to
> > > > > > either find some MKL version that works (if you do please
> > > > > > report
> > > > > > it
> > > > > > here), link with OpenBLAS (using it for lapwso is enough)
> > > > > > or
> > > > > > create
> > > > > > a
> > > > > > simple wrapper that sets the MKL_NUM_THREADS=1 just for
> > > > > > lapwso,
> > > > > > i.e.,
> > > > > > rename lapwso binary in WIENROOT to lapwso_bin and create
> > > > > > new
> > > > > > lapwso
> > > > > > file there with:
> > > > > > 
> > > > > > #!/bin/bash
> > > > > > MKL_NUM_THREADS=1 lapwso_bin $1
> > > > > > 
> > > > > > and set it to executable with chmod +x lapwso.
> > > > > > 
> > > > > > Or maybe MKL has a non-OpenMP version which you could link
> > > > > > with
> > > > > > just
> > > > > > lapwso and use standard one in other parts, but dunno, I
> > > > > > mostly
> > > > > > use
> > > > > > OpenBLAS. If you need some further help, let me know.
> > > > > > 
> > > > > > Reporting the issue to intel could be also nice, however I
> > > > > > never
> > > > > > had
> > > > > > any real luck there and it is also a bit problematic as you
> > can't
> > > > > > provide testcase due to Wien2k being proprietary code...
> > > > > > 
> > > > > > Best regards
> > > > > > Pavel
> > > > > > 
> > > > > > >    
> > > > > > > > This should disable the Wien2k-specific OpenMP
> > > > > > > > parallelism
> > > > > > > > but
> > > > > > still
> > > > > > > > keep the rest of paralellism at the BLAS/lapack level.
> > > > > > > > 
> > > > > > > 
> > > > > > > So, perhaps, the problem is related to MKL!
> > > > > > >    
> > > > > > > > Another option is that something is going wrong before
> > > > > > > > lapwso
> > > > > > and the
> > > > > > > > lapwso crash is just the symptom. What happens if you
> > > > > > > > run
> > > > > > everything
> > > > > > > > up
> > > > > > > > to lapwso without OpenMP (OMP_NUM_THREADS=1) and than
> > > > > > > > enable
> > > > > > > > it
> > > > > > just
> > > > > > > > for lapwso?
> > > > > > > > 
> > > > > > > 
> > > > > > > If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then
> > > > > > > change
> > > > > > it to 1
> > > > > > > just before lapwso, it works.
> > > > > > > If I do the opposite, starting with OMP_NUM_THREADS=1 and
> > > > > > > then
> > > > > > change
> > > > > > > it to 4 just before lapwso, it does not work.
> > > > > > > So I believe that the problem is really at lapwso.
> > > > > > >    
> > > > > > >      If you need more information, please, let me know!
> > > > > > >      All the best,
> > > > > > >               Luis
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Wien mailing list
> > > > > > Wien at zeus.theochem.tuwien.ac.at
> > > > > > 
> > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$
> > > > > > 
> > > > > > SEARCH the MAILING-LIST at:
> > > > > > 
> > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Professor Laurence Marks
> > > > > Department of Materials Science and Engineering
> > > > > Northwestern University
> > > > > http://www.numis.northwestern.edu
> > > > > "Research is to see what everybody else has seen, and to
> > > > > think
> > what
> > > > > nobody else has thought" Albert Szent-Györgyi
> > > > 
> > > > _______________________________________________
> > > > Wien mailing list
> > > > Wien at zeus.theochem.tuwien.ac.at
> > > > 
> > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> > 
> > > > SEARCH the MAILING-LIST at:
> > > > 
> > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
> > 
> > > 
> > > _______________________________________________
> > > Wien mailing list
> > > Wien at zeus.theochem.tuwien.ac.at
> > > 
> > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$
> > 
> > > SEARCH the MAILING-LIST at: 
> > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$
> > 
> > 
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at: 
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html




More information about the Wien mailing list