<div dir="ltr"><div>Dear Wien2k Community,</div><div> I have just sent the "problematic" directory to Prof. Pavel <span class="gmail-gI"><span class="gmail-qu" role="gridcell" tabindex="-1"><span name="Pavel Ondračka" class="gmail-gD"><span>Ondračka.</span></span></span></span></div><div><span class="gmail-gI"><span class="gmail-qu" role="gridcell" tabindex="-1"><span name="Pavel Ondračka" class="gmail-gD"><span> It is too large to be sent through the list, but if you are interested in, please let me know and I will send you a copy.<br></span></span></span></span><div> Feel free to distribute the directory to anyone and test it in any way.</div><div> All the best,</div><div> Luis</div></div><div><span class="gmail-gI"><span class="gmail-qu" role="gridcell" tabindex="-1"><span name="Pavel Ondračka" class="gmail-gD"><span> <br></span></span></span></span></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Em qua., 18 de ago. de 2021 às 15:01, Pavel Ondračka <<a href="mailto:pavel.ondracka@email.cz">pavel.ondracka@email.cz</a>> escreveu:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Right, I think that the reason deallocate is failing because the memory<br>
has been corrupted at some earlier point is quite clear, the only other<br>
option why it should crash would be that it was not allocated at all,<br>
which seem not to be the case here... The question is what corrupted<br>
the memory and even more strange is why does it work if we disable MKL<br>
multithreading?<br>
<br>
It could indeed be that we are doing something wrong. I can imagine the<br>
memory could be corrupted in some BLAS call if the number of<br>
columns/rows passed to the specific BLAS call is more than the actual<br>
size of the matrix, than this could easily happen (and the<br>
multithreading is somehow influencing what the final value of the<br>
corrupted memory, and depending on the final value the deallocate could<br>
fail or pass somehow). This should be possible to diagnose with<br>
valgrind as suggested.<br>
<br>
Luis, can you upload the testcase somewhere, or recompile with<br>
debuginfo as suggested by Laurence earlier, run "valgrind --track-<br>
origins=yes lapwso lapwso.def" and send the output? Just be warned,<br>
there is a massive slowdown with valgrind (up to 100x) and the logfile<br>
can get very large.<br>
<br>
Best regards<br>
Pavel<br>
<br>
<br>
On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote:<br>
> Correction, I was looking at an older modules.F. It looks like it<br>
> should be<br>
> <br>
> DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV<br>
> <br>
> <br>
> On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks<br>
> <<a href="mailto:laurence.marks@gmail.com" target="_blank">laurence.marks@gmail.com</a>> wrote:<br>
> > I do wonder about this. I suggest editing module.F and changing<br>
> > lines 118 and 119 to<br>
> > DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err en<br>
> > ',ien<br>
> > DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne. 0)write(*,*)'Err<br>
> > vnorm ',Ivn<br>
> > <br>
> > There is every chance that the bug is not in those lines, but<br>
> > somewhere completely different. SIGSEV often means that the code<br>
> > has been overwritten, for instance arrays going out of bounds.<br>
> > <br>
> > You can also recompile with -g (don't change other options)<br>
> > added, and/or -C. Sometimes this is better. Or use other things<br>
> > like debuggers or valgrind.<br>
> > <br>
> > On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka<br>
> > <<a href="mailto:pavel.ondracka@email.cz" target="_blank">pavel.ondracka@email.cz</a>> wrote:<br>
> > > I'm CCing the list back as the crash was now diagnosed to a<br>
> > > likely<br>
> > > MKL<br>
> > > problem, see below for more details.<br>
> > > > <br>
> > > > <br>
> > > > > So just to be clear, explicitly setting OMP_STACKSIZE=1g does<br>
> > > not<br>
> > > > > help<br>
> > > > > to solve the issue?<br>
> > > > > <br>
> > > > <br>
> > > > <br>
> > > > Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not solve<br>
> > > > the<br>
> > > > problem!<br>
> > > > <br>
> > > > > <br>
> > > > > The problem is that the OpenMP code in lapwso is very simple,<br>
> > > so I'm<br>
> > > > > having problems seeing how it could be causing the problems.<br>
> > > > > <br>
> > > > > Could you also try to see what happens if run with:<br>
> > > > > OMP_NUM_THREADS=1<br>
> > > > > MKL_NUM_THREADS=4<br>
> > > > > <br>
> > > > <br>
> > > > <br>
> > > > It does not work with these values, but I checked and it works<br>
> > > > reverting them:<br>
> > > > OMP_NUM_THREADS=4<br>
> > > > MKL_NUM_THREADS=1<br>
> > > <br>
> > > This was very helpfull and IMO points to a problem with MKL<br>
> > > instead<br>
> > > of<br>
> > > Wien2k.<br>
> > > <br>
> > > Unfortunatelly setting MKL_NUM_THREADS=1 globally will reduce the<br>
> > > OpenMP performance, mostly in lapw1 but also at other places. So<br>
> > > if<br>
> > > you<br>
> > > want to keep the OpenMP BLAS/lapack level parallelism you have to<br>
> > > either find some MKL version that works (if you do please report<br>
> > > it<br>
> > > here), link with OpenBLAS (using it for lapwso is enough) or<br>
> > > create<br>
> > > a<br>
> > > simple wrapper that sets the MKL_NUM_THREADS=1 just for lapwso,<br>
> > > i.e.,<br>
> > > rename lapwso binary in WIENROOT to lapwso_bin and create new<br>
> > > lapwso<br>
> > > file there with:<br>
> > > <br>
> > > #!/bin/bash<br>
> > > MKL_NUM_THREADS=1 lapwso_bin $1<br>
> > > <br>
> > > and set it to executable with chmod +x lapwso.<br>
> > > <br>
> > > Or maybe MKL has a non-OpenMP version which you could link with<br>
> > > just<br>
> > > lapwso and use standard one in other parts, but dunno, I mostly<br>
> > > use<br>
> > > OpenBLAS. If you need some further help, let me know.<br>
> > > <br>
> > > Reporting the issue to intel could be also nice, however I never<br>
> > > had<br>
> > > any real luck there and it is also a bit problematic as you can't<br>
> > > provide testcase due to Wien2k being proprietary code...<br>
> > > <br>
> > > Best regards<br>
> > > Pavel<br>
> > > <br>
> > > > <br>
> > > > > <br>
> > > > > This should disable the Wien2k-specific OpenMP parallelism<br>
> > > > > but<br>
> > > still<br>
> > > > > keep the rest of paralellism at the BLAS/lapack level.<br>
> > > > > <br>
> > > > <br>
> > > > <br>
> > > > So, perhaps, the problem is related to MKL!<br>
> > > > <br>
> > > > > <br>
> > > > > Another option is that something is going wrong before lapwso<br>
> > > and the<br>
> > > > > lapwso crash is just the symptom. What happens if you run<br>
> > > everything<br>
> > > > > up<br>
> > > > > to lapwso without OpenMP (OMP_NUM_THREADS=1) and than enable<br>
> > > > > it<br>
> > > just<br>
> > > > > for lapwso?<br>
> > > > > <br>
> > > > <br>
> > > > <br>
> > > > If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then change<br>
> > > it to 1<br>
> > > > just before lapwso, it works. <br>
> > > > If I do the opposite, starting with OMP_NUM_THREADS=1 and then<br>
> > > change<br>
> > > > it to 4 just before lapwso, it does not work.<br>
> > > > So I believe that the problem is really at lapwso.<br>
> > > > <br>
> > > > If you need more information, please, let me know!<br>
> > > > All the best,<br>
> > > > Luis<br>
> > > <br>
> > > <br>
> > > _______________________________________________<br>
> > > Wien mailing list<br>
> > > <a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
> > > <a href="https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$" rel="noreferrer" target="_blank">https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$</a><br>
> > > <br>
> > > SEARCH the MAILING-LIST at: <br>
> > > <a href="https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$" rel="noreferrer" target="_blank">https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$</a><br>
> > > <br>
> > <br>
> > <br>
> > -- <br>
> > Professor Laurence Marks<br>
> > Department of Materials Science and Engineering<br>
> > Northwestern University<br>
> > <a href="http://www.numis.northwestern.edu" rel="noreferrer" target="_blank">www.numis.northwestern.edu</a><br>
> > "Research is to see what everybody else has seen, and to think what<br>
> > nobody else has thought" Albert Szent-Györgyi<br>
> <br>
> <br>
> _______________________________________________<br>
> Wien mailing list<br>
> <a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
> <a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
> SEARCH the MAILING-LIST at: <br>
> <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
<br>
<br>
_______________________________________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" rel="noreferrer" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
SEARCH the MAILING-LIST at: <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" rel="noreferrer" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
</blockquote></div>