[Wien] Lapw1 mpi run problem ( gfortran+openmpi)

Deyerling, André andre.deyerling at tum.de
Fri Apr 24 11:05:04 CEST 2020


Dear WIEN2k users,


I run into the following problem when running WIEN2k in parallel with mpi. WIEN2k Version is 19.1, the patches provided by Gavin Abo are installed. Elpa/FFTW3/Scalapack are used and compiled with gcc/gfortran mpicc/mpif90. The Compilation of

WIEN2k shows no errors.


K-Point parallelization  works fine, WIEN2k is installed on a NFS share on a small selfbuild cluster (right now only 4 nodes but will be more if everything runs).


The Problem looks like a problem with openmpi, however simple exemplary mpif90 programs work fine when run in parallel. Something goes wrong with lapw1para.


----------------------------------------------------------------------------------------------------------------------------------

run_lapw -p
STOP  LAPW0 END
[1]    Done                          /usr/lib64/openmpi/bin/mpirun -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine0 /home/mpiuser/WIEN2k-19.1/lapw0_mpi lapw0.def >> .time00
[node0:1423512:0:1423512] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
[node0:1423513:0:1423513] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x1b25f) [0x1462b91ad25f]
    1  /usr/lib64/libucs.so.0(+0x1b42a) [0x1462b91ad42a]
    2  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x4482df]
    3  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x40d1c5]
    4  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x42dd6e]
    5  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x404ded]
    6  /usr/lib64/libc.so.6(__libc_start_main+0xf3) [0x1462ba7bb1a3]
    7  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x404e1e]
===================
    0  /usr/lib64/libucs.so.0(+0x1b25f) [0x14b734f3725f]
    1  /usr/lib64/libucs.so.0(+0x1b42a) [0x14b734f3742a]
    2  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x4482df]
    3  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x40d1c5]
    4  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x42dd6e]
    5  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x404ded]
    6  /usr/lib64/libc.so.6(__libc_start_main+0xf3) [0x14b7365451a3]
    7  /home/mpiuser/WIEN2k-19.1/lapw1_mpi() [0x404e1e]
===================

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node node0 exited on signal 11 (Segmentation fault).

[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
--------------------------------------------------------------------------------------------------------------------------------

Dayfile of the case:


Calculating Testsession in /home/mpiuser/WIEN2k/Testsession
on node0 with PID 1423240
using WIEN2k_19.1 (Release 25/6/2019) in /home/mpiuser/WIEN2k-19.1


    start (Mon 20 Apr 2020 01:52:09 PM CEST) with lapw0 (40/99 to go)

    cycle 1 (Mon 20 Apr 2020 01:52:09 PM CEST) (40/99 to go)

>   lapw0   -p (13:52:09) starting parallel lapw0 at Mon 20 Apr 2020 01:52:09 PM CEST
-------- .machine0 : 2 processors
1.028u 0.157s 0:02.41 48.5% 0+0k 0+496io 0pf+0w
>   lapw1  -p     (13:52:11) starting parallel lapw1 at Mon 20 Apr 2020 01:52:11 PM CEST
->  starting parallel LAPW1 jobs at Mon 20 Apr 2020 01:52:11 PM CEST
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
     node0 node1(72) 0.100u 0.089s 0:01.03 17.4% 0+0k 0+8io 0pf+0w
   Summary of lapw1para:
   node0 k=0 user=72 wallclock=5.34
**  LAPW1 crashed!
0.178u 0.148s 0:02.21 14.0% 0+0k 0+136io 0pf+0w
error: command   /home/mpiuser/WIEN2k-19.1/lapw1para lapw1.def   failed

>   stop error


Parallel_Options:

setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "/usr/lib64/openmpi/bin/mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_ -machinefile _HOSTS_ _EXEC_"
setenv CORES_PER_NODE 1


.machines file:


1:node0:1 node1:1

lapw0:node0:1 node1:1

granularity:1


Help would be greatly appreciated.


Best Regards


André Deyerling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20200424/0f44eac0/attachment.html>


More information about the Wien mailing list