[Wien] Getting "Segmentation fault / execvp" error when running WIEN2k_23.2 in parallel

Brian Lee brianhlee at utexas.edu
Thu Mar 23 19:14:28 CET 2023


Hello WIEN2k users/developers,

I am a graduate student at UT Austin in the MS&E program and would like to
test

WIEN2k_23.2 using various parallelization schemes. When I try to run
“run_lapw -p” with the default MPI run command suggested during siteconfig
along with a .machines file/job script that requests 2 processors per lapw0
and/or 2 processors per kpt, I receive the following error:

[c302-005:191113:0:191113] Caught signal 11 (Segmentation fault: invalid
permissions for mapped object at address 0x14a9ed892798)

[c302-005:191112:0:191112] Caught signal 11 (Segmentation fault: invalid
permissions for mapped object at address 0x145a05e75798)

==== backtrace (tid: 191113) ====

 0 0x0000000000012ce0 __funlockfile()  :0

 1 0x000000000033c798 ???()
/scratch/tacc/apps/intel19/impi19_0/fftw3/3.3.10/lib/libfftw3.so.3:0

=================================

w2k_dispatch_signal(): received: Segmentation fault

Abort(324) on node 0 (rank 0 in comm 0): application called
MPI_Abort(MPI_COMM_WORLD, 324) - process 0

==== backtrace (tid: 191112) ====

 0 0x0000000000012ce0 __funlockfile()  :0

 1 0x000000000033c798 ???()
/scratch/tacc/apps/intel19/impi19_0/fftw3/3.3.10/lib/libfftw3.so.3:0

=================================

w2k_dispatch_signal(): received: Segmentation fault

Abort(324) on node 0 (rank 0 in comm 0): application called
MPI_Abort(MPI_COMM_WORLD, 324) - process 0

srun: error: c302-005: tasks 0-1: Exited with exit code 68

srun: launch/slurm: _step_signal: Terminating StepId=755176.1

[1]    Exit 68                       srun -K -N1 -n2 -r0
/home1/08844/leebrian/wien2k/lapw0_mpi lapw0.def >> .time00

cat: No match.

And the .dayfile shows that the SCF calculation crashes on the first cycle:

>   lapw0   -p          (17:01:09) starting parallel lapw0 at Thu Mar 16
17:01:09 CDT 2023

-------- .machine0 : 4 processors

**  lapw0 crashed!

4.061u 2.731s 0:04.38 155.0%    0+0k 16739+784io 82pf+0w

error: command   /home1/08844/leebrian/wien2k/lapw0para lapw0.def   failed

I've tried following Prof. Marks' recommendations on the post (
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2016-February/024357.html),
since my system also uses "ibrun" to launch MPI jobs, but that leads to an
‘execvp’ error:

[proxy:0:0 at c305-005.ls6.tacc.utexas.edu] HYD_spawn
(../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:145):
execvp error on file .machine0 (Permission denied)

[proxy:0:0 at c305-005.ls6.tacc.utexas.edu] HYD_spawn
(../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:145):
execvp error on file .machine0 (Permission denied)

[1]    Exit 255                      ibrun -n 2 -o 0 -machinefile .machine0
/home1/08844/leebrian/wien2k/lapw0_mpi lapw0.def >> .time00

cat: No match.

For reference, I have attached my generated .machines file below and I am
running my calculations on the “lonestar6” cluster at TACC (
https://portal.tacc.utexas.edu/user-guides/lonestar6). I am able to run
test cases (ranging from 3 atom unit cells to 50+ atom cells) using a
.machines file that pins one processor per lapw0 and per kpoint. My system
admin believes that the issue lies with how WIEN2k launches MPI jobs
(particularly with how the MPI run command is set up and how WIEN2k uses
the “CORES_PER_NODE” setting), so I am hoping that someone in the WIEN2k
community might have some insight to be able to help me resolve this issue.
Thank you for your input and time

JobScript:

#!/bin/sh

#

#SBATCH -J test-tacc           # Job name

#SBATCH --output=out.%j

#SBATCH --error=err.%j

#SBATCH -p development          # Queue (partition) name

#SBATCH -N 4               # Total # of nodes

#SBATCH -n 64              # Total # of mpi tasks

#SBATCH -t 00:10:00        # Run time (hh:mm:ss)

#SBATCH --mail-type=all    # Send email at begin and end of job

#SBATCH --mail-user=brianhlee at utexas.edu


export WIENROOT='/home1/08844/leebrian/wien2k'

echo "$WIENROOT"

cd $SLURM_SUBMIT_DIR

./wien2k_tasks_v4.sh 2 4 #processors to lapw0 & processors per kpt

$WIENROOT/run_lapw -p

.machines:

#

granularity:1

lapw0:c305-005 c305-005:1

1:c305-005 c305-005:1

1:c305-005 c305-005:1

1:c307-005 c307-005:1

1:c307-005 c307-005:1

1:c308-005 c308-005:1

1:c308-005 c308-005:1

1:c308-006 c308-006:1

1:c308-006 c308-006:1

extrafine:1

.machine0:

c305-005

c305-005


Regards,

Brian Lee  |  Graduate Student

The University of Texas at Austin | Texas Materials Institute

(he/him/his)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230323/0d6b8ced/attachment.htm>


More information about the Wien mailing list