[Wien] New findings on the lapw0 seg fault core dump error
Michael Fechtelkord
Michael.Fechtelkord at ruhr-uni-bochum.de
Mon Jun 9 11:10:45 CEST 2025
Hello Gerhard,
thanks for your detailed remarks. My impression is that the omp
directive "do reduction" can be a problem if the reduction loops are not
programmed clean (so I read from reports in google, I am no FORTRAN
programmer). If you google "!$OMP do reduction SIGSEGV ifx" you get
several hits where it happens in different source codes with ifx but
also with ifort (which should happened at 2020 due to the comment of
jdmount - at that time ifx was not yet invented):
Some examples:
https://community.intel.com/t5/Intel-Fortran-Compiler/Seg-fault-with-OpenMP-loop-in-IFX-2024-0-2-Linux/td-p/1572916
https://stackoverflow.com/questions/8583720/ifort-mpi-openmp-segmentation-fault
Interesting is that lapw0 only crashes in the first cycle no matter
which structure you are using, but after the first mixing you can set 8
threads and use parallelisation without crashes. So it depends on the input.
Best regards,
Michael
Am 08.06.2025 um 20:04 schrieb Fecher, Gerhard:
> Hallo Michael,
>
> From the comments (*** LDM CHANGES ***), it seems that Laurence added or changed the omp part, I guess he may know what is potentially going wrong
>
> The code following line 1356 is very roughly similar to that at 1644 ff but no segmentation fault appears, but has the same comment by jdoumont.
>
> The segmentation fault appears with ifx (from 2025.2) but not with ifort (last Version 2025.2) as far as I remember.
>
> Note:
> If you check from your github link the files bad.txt, fix.txt, and good.txt then you see that they just removed the bad line in good.txt by commenting it for test, thats all.
> in fix.txt they removed the $omp directives, what is the same that I did already at the beginning of the year (see my post "ifx OMP Problem in lapw0 ...." from 2nd January)
> However, this just prevents the problem but does not heal it !
>
> I guess good.txt is just to locate the errorneous file and is not a suggestion to remove this line !!
>
> Ciao
> Gerhard
>
> DEEP THOUGHT in D. Adams; Hitchhikers Guide to the Galaxy:
> "I think the problem, to be quite honest with you,
> is that you have never actually known what the question is."
>
> ====================================
> Dr. Gerhard H. Fecher
> Institut of Physics
> Johannes Gutenberg - University
> 55099 Mainz
> ________________________________________
> Von: Wien [wien-bounces at zeus.theochem.tuwien.ac.at] im Auftrag von Michael Fechtelkord via Wien [wien at zeus.theochem.tuwien.ac.at]
> Gesendet: Sonntag, 8. Juni 2025 13:13
> An: A Mailing list for WIEN2k users
> Cc: Michael Fechtelkord
> Betreff: Re: [Wien] New findings on the lapw0 seg fault core dump error
>
> Hello Gerhard and Peter,
>
>
> I am using ifx 2025.1.1 and I also read that OpenMP reductions cause a
> segfault using Intel compilers. They recommend serializing the loops or
> removing the line that performs the reduction eliminate the segfault.
>
> https://github.com/flang-compiler/flang/issues/56
>
>
> I have answered Peter's question below inserted between his comments.
>
> So can I comment the reduction procedure out (it is not needed?).
> Serializing in the first cycle I did already by setting omp_lapw0:1.
> After the first cycle lapw0 runs smooth even with 8 omp_threads.
>
>
> Best regards,
>
> Michael
>
>
> Am 08.06.2025 um 10:27 schrieb Fecher, Gerhard:
>> Dear Peter and Michael,
>> I receive the segmentation fault with OneAPI 2024.2 and OneAPI 2025.1
>> it appears already with -O1
>>
>> I mentioned already some time ago: when I comment the $omp directives at lines 1649 ff. then the program runs smooth.
>>
>> It seems that this is an old unresolved problem, as it is mentioned in a comment by jdoumont 30/7/20
>> (however, it seems not to depend on the size of the calculation)
>>
>> Ciao
>> Gerhard
>>
>> DEEP THOUGHT in D. Adams; Hitchhikers Guide to the Galaxy:
>> "I think the problem, to be quite honest with you,
>> is that you have never actually known what the question is."
>>
>> ====================================
>> Dr. Gerhard H. Fecher
>> Institut of Physics
>> Johannes Gutenberg - University
>> 55099 Mainz
>> ________________________________________
>> Von: Wien [wien-bounces at zeus.theochem.tuwien.ac.at] im Auftrag von Peter Blaha [peter.blaha at tuwien.ac.at]
>> Gesendet: Samstag, 7. Juni 2025 20:40
>> An: wien at zeus.theochem.tuwien.ac.at
>> Betreff: Re: [Wien] New findings on the lapw0 seg fault core dump error
>>
>> Very curious.
>>
>> Is "number of PW" in case.clmsum after init_lapw and after the
>> first cycle identical ?
> Number of PW is 2239 in the starting case.clmsum as well as in the
> case.clmsum after the first cycle
>> Since this is a small case: Can you manually look at the
>> Fouriercoefficients in clmsum. Any "huge" numbers ? Any *** numbers,
> No big numbers, no ****
>> After dstart, I guess none of the FK are zero. After mixer (after 1st
>> iteration) the later ones should be zero.
>>
>> My guess is a problem in the libthread library of your compiler version
>> (ifx 2025.xxx ?). The problems did not show up with previous compilers ?
> I am using ifx 2025.1.1
>>
>> Am 07.06.2025 um 18:18 schrieb Michael Fechtelkord via Wien:
>>> smiles .. no it is MgF2.. Just two atoms in a cubic cell. and it is not
>>> dependent on the structure. It crashes for all in the first cycle using
>>> the clmsum from the init_lapw
>>>
>>> Am 07.06.2025 um 17:34 schrieb Peter Blaha:
>>>> Is this a big supercell ?
>>>>
>>>> The only thing I could imagine is that the number of PWs is bigger
>>>> after dstart than after the 1st cycle.
>>>> grep for "PW" in the clmsum files from dstart and after the 1st cycle.
>>>> Eventually reduce number of PW until it works as a temporary fix.
>>>> It might be a "stack" problem and I think one can increase this
>>>> somehow, but I can't remember how.
>>>>
>>>> Am 06.06.2025 um 22:25 schrieb Michael Fechtelkord via Wien:
>>>>> and a additional comment.
>>>>>
>>>>>
>>>>> lapw0 crashes only in the first cycle with OMP_NUM_THREADS higher
>>>>> than 1. When I set lapw0:1 for the first cycle (using -i 1 in
>>>>> run_lapw) and then after the first run set it back to lapw0:8 it runs
>>>>> without a problem for the complete scf cycle. It seems that is a
>>>>> problem with the initial case.clmsum file (init_lapw -b -prec 1).
>>>>>
>>>>>
>>>>> Am 06.06.2025 um 22:07 schrieb Michael Fechtelkord via Wien:
>>>>>> Hello Peter,
>>>>>>
>>>>>>
>>>>>> omp_lapw0 in .machines was 8. I reduced it from 8 to 4, then to 2
>>>>>> and finally to 1. Only in the case of omp_lapw0:1 lapw0 does not crash.
>>>>>>
>>>>>> omp_global:2
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> Am 06.06.2025 um 17:59 schrieb Peter Blaha:
>>>>>>> What was your OMP_NUM_THREADS variable ?
>>>>>>>
>>>>>>> Set it to 1, 2, ... and check if the error occurs again.
>>>>>>>
>>>>>>> Am 06.06.2025 um 14:07 schrieb Michael Fechtelkord via Wien:
>>>>>>>> I debugged the core-dump file with gdb and using debugging symbols
>>>>>>>> in compilation of lapw0.
>>>>>>>>
>>>>>>>> The debugger gave me the line which causes the coredump
>>>>>>>>
>>>>>>>> _----------------------------------------
>>>>>>>>
>>>>>>>> Debuginfod has been enabled.
>>>>>>>> To make this setting permanent, add 'set debuginfod enabled on'
>>>>>>>> to .gdbinit.
>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>>>>>> Core was generated by `/usr/local/WIEN2k/lapw0 lapw0.def'.
>>>>>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>>>>>>
>>>>>>>> #0 0x000000000048b89b in
>>>>>>>> MAIN__.DIR.OMP.PARALLEL.LOOP.12.split63842.split63939 ()*at
>>>>>>>> lapw0.F:1649*
>>>>>>>>
>>>>>>>> *1649 !$omp parallel do reduction(+:rhopw00,cwk,cvout) &*
>>>>>>>>
>>>>>>>>
>>>>>>>> [Current thread is 1 (Thread 0x14823edbe740 (LWP 339344))]
>>>>>>>>
>>>>>>>> ------------------------------------
>>>>>>>>
>>>>>>>> Maybe somebody has an idea how to fix it..
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 17.05.2025 um 13:48 schrieb Michael Fechtelkord via Wien:
>>>>>>>>> Hello everybody,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have new results considering the lapw0 crash which happens
>>>>>>>>> partially (segmentation fault error - core dump).
>>>>>>>>>
>>>>>>>>> It seems that the crucial thing is the case.clmsum file. (I am no
>>>>>>>>> expert here) But if this is somehow the key. It can produce the
>>>>>>>>> lapw0 so it might be that it is sometimes triggering the lapw0.
>>>>>>>>>
>>>>>>>>> I calculated MgF2 and substituted the new generated clmsum by an
>>>>>>>>> older one and then there was no crash. I cannot attach them
>>>>>>>>> because the file size is too large.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not so into debugging, to find out why and where it happens.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. Michael Fechtelkord
>>>>>>>>
>>>>>>>> Institut für Geologie, Mineralogie und Geophysik
>>>>>>>> Ruhr-Universität Bochum
>>>>>>>> Universitätsstr. 150
>>>>>>>> D-44780 Bochum
>>>>>>>>
>>>>>>>> Phone: +49 (234) 32-24380
>>>>>>>> Fax: +49 (234) 32-04380
>>>>>>>> Email:Michael.Fechtelkord at ruhr-uni-bochum.de
>>>>>>>> Web Page:https://www.ruhr-uni-bochum.de/kristallographie/kc/
>>>>>>>> mitarbeiter/fechtelkord/
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Wien mailing list
>>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/
>>>>>>>> wien at zeus.theochem.tuwien.ac.at/index.html
>> --
>> -----------------------------------------------------------------------
>> Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
>> Phone: +43-158801165300
>> Email: peter.blaha at tuwien.ac.at
>> WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at
>> -------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> --
> Dr. Michael Fechtelkord
>
> Institut für Geologie, Mineralogie und Geophysik
> Ruhr-Universität Bochum
> Universitätsstr. 150
> D-44780 Bochum
>
> Phone: +49 (234) 32-24380
> Fax: +49 (234) 32-04380
> Email: Michael.Fechtelkord at ruhr-uni-bochum.de
> Web Page: https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
Dr. Michael Fechtelkord
Institut für Geologie, Mineralogie und Geophysik
Ruhr-Universität Bochum
Universitätsstr. 150
D-44780 Bochum
Phone: +49 (234) 32-24380
Fax: +49 (234) 32-04380
Email: Michael.Fechtelkord at ruhr-uni-bochum.de
Web Page: https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/
More information about the Wien
mailing list