[Wien] lapw0 stuck/drained with seemingly no error message upon launching SCF

Gavin Abo gabo13279 at gmail.com
Mon Jun 24 08:06:55 CEST 2024


Just to check, do you already have your WIEN2k 23.2 lapw0 built with 
gfortran patched with the fix at:

https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg23272.html

The gfortran compiler on Linux needed the patch, but the Intel ifort 
seemed to work fine without the patch.  However, I don't know about 
gfortran on MAC as I don't have a system with the macOS.

It's unfortunate that only the disk image (.dmg) for the older Lion 10.7 
to Sierra 10.12 are web browser downloadable from [1]. Since, running of 
your particular MacOS might have been possible in a virtual machine on 
Windows or Linux with a disk image.  The MacOS version you are probably 
using is likely among the newer versions that look to be only 
downloadable from the App Store using a compatible Mac.  Meaning access 
to a .dmg is not available to us non-MAC users for trying to help 
troubleshot the issue.

[1] https://support.apple.com/en-us/102662

Kind Regards,
Gavin
WIEN2k user

On 6/23/2024 10:24 PM, Yichen Zhang wrote:
> Dear Laurence and Peter,
>
> 1) No, I did not run with omp. The above discussions in threads are all in sequential mode (no -p). However, indeed I have tested dstart and lapw0 in parallel mode, where lapw0 hangs similarly like in serial mode and dstart parallel mode runs fine. Just in case, I attach below one version of my .machines file when I ran dstart in sequential but lapw0 in parallel mode with 2 processors:
> ***********
> #dstart:localhost localhost
> speed:localhost localhost
> lapw0:localhost localhost
>
> 1:localhost
> 1:localhost
> granularity:1
> extrafine:1
>
> omp_global:16
> ***********
> And of course, I never made it to lapw1, due to the lapw0 hanging issue.
>
> 2) Through inserting a bunch of PRINT *, “BREAKPOINT1,2,3,…”, the exact line of the where the programme hangs has been determined. In the output of “time lapw0 lapw0.def”, it hangs exactly at CALL XCPOT1(luse2,LM,…). The context in lapw0.F is:
> ***********
> if (.not.xcpot1qq) then
>    PRINT *, “BREAKPOINT13”
>    CALL XCPOT1(luse2,LM,…)
>    PRINT *, “BREAKPOINT14”
> ***********
> BREAKPOINT13 is the last printed out. 14 is not printed. Importantly, no any BREAKPOINT within the subroutine XCPOT1 is printed. The first “BREAKPOINT” in XCPOT1 is at the earliest legit position after all the USE, IMPLICIT NONE, and parameters declaration. It doesn’t get printed. That seems to tell XCPOT1 is called but never runs, so the code hangs after “BREAKPOINT13” and never prints out the BREAKPOINTs in XCPOT1 or BREAKPOINT14.
> I don’t understand why, considering XCPOT1 subroutine seems legit and compiled fine...
>
> 3) My last resort was to ask ChatGPT why subroutines can hang, it suggested 7 possibilities from programming level to system level. And I provide some of my guess and questions on these possibilities.
>   a) Infinite loops. I have checked all DO loops in XCPOT1.f, but all loops are closed. If there is any, compiler should have found that. So NO.
>   b) Large memory allocation. There is no large array allocation in XCPOT1, despite three dynamic allocations. So NOT likely.
>   c) Recursion without proper termination. NO. XCPOT1 is not a recursive subroutine.
>   d) Blocking I/O operations. NO. It was not waiting for user input or reading from a slow device.
>   e) Incorrect use of pointers. NO. I didn’t find pointers in XCPOT1.
>   f) Stack overflow. No. Again, I didn’t see any recursion or large arrays. The three dynamic allocatables seem small.
>   g) Deadlocks. This is the part I don’t quite understand if it could happen, but my guess is no. Even though I run lapw0 in sequential mode, could circular dependency between tasks still happen when the programme runs on an Apple silicon Mac system?
>
> This is where the problem is stuck at the moment, unfortunately.
>
>
> Best regards
> Yichen


More information about the Wien mailing list