[Wien] lapw0 stuck/drained with seemingly no error message upon launching SCF

Yichen Zhang zycforphysics at gmail.com
Mon Jun 24 06:24:15 CEST 2024


Dear Laurence and Peter,

1) No, I did not run with omp. The above discussions in threads are all in sequential mode (no -p). However, indeed I have tested dstart and lapw0 in parallel mode, where lapw0 hangs similarly like in serial mode and dstart parallel mode runs fine. Just in case, I attach below one version of my .machines file when I ran dstart in sequential but lapw0 in parallel mode with 2 processors:
***********
#dstart:localhost localhost
speed:localhost localhost
lapw0:localhost localhost

1:localhost
1:localhost
granularity:1
extrafine:1

omp_global:16
***********
And of course, I never made it to lapw1, due to the lapw0 hanging issue.

2) Through inserting a bunch of PRINT *, “BREAKPOINT1,2,3,…”, the exact line of the where the programme hangs has been determined. In the output of “time lapw0 lapw0.def”, it hangs exactly at CALL XCPOT1(luse2,LM,…). The context in lapw0.F is:
***********
if (.not.xcpot1qq) then
  PRINT *, “BREAKPOINT13”
  CALL XCPOT1(luse2,LM,…)
  PRINT *, “BREAKPOINT14”
***********
BREAKPOINT13 is the last printed out. 14 is not printed. Importantly, no any BREAKPOINT within the subroutine XCPOT1 is printed. The first “BREAKPOINT” in XCPOT1 is at the earliest legit position after all the USE, IMPLICIT NONE, and parameters declaration. It doesn’t get printed. That seems to tell XCPOT1 is called but never runs, so the code hangs after “BREAKPOINT13” and never prints out the BREAKPOINTs in XCPOT1 or BREAKPOINT14.
I don’t understand why, considering XCPOT1 subroutine seems legit and compiled fine...

3) My last resort was to ask ChatGPT why subroutines can hang, it suggested 7 possibilities from programming level to system level. And I provide some of my guess and questions on these possibilities.
 a) Infinite loops. I have checked all DO loops in XCPOT1.f, but all loops are closed. If there is any, compiler should have found that. So NO.
 b) Large memory allocation. There is no large array allocation in XCPOT1, despite three dynamic allocations. So NOT likely.
 c) Recursion without proper termination. NO. XCPOT1 is not a recursive subroutine.
 d) Blocking I/O operations. NO. It was not waiting for user input or reading from a slow device.
 e) Incorrect use of pointers. NO. I didn’t find pointers in XCPOT1.
 f) Stack overflow. No. Again, I didn’t see any recursion or large arrays. The three dynamic allocatables seem small.
 g) Deadlocks. This is the part I don’t quite understand if it could happen, but my guess is no. Even though I run lapw0 in sequential mode, could circular dependency between tasks still happen when the programme runs on an Apple silicon Mac system?

This is where the problem is stuck at the moment, unfortunately.


Best regards
Yichen


More information about the Wien mailing list