[Wien] Getting "Segmentation fault / execvp" error when running WIEN2k_23.2 in parallel

Brian Lee brianhlee at utexas.edu
Mon Mar 27 03:21:27 CEST 2023


Hi, thank you for the responses

Yes, sorry the dayfile was from a different test run. The run using
"./wien2k_tasks_v4.sh 2 4" shows it as:


>   lapw0   -p          (12:51:21) starting parallel lapw0 at Thu Mar 23
12:51:21 CD$

-------- .machine0 : 2 processors

**  lapw0 crashed!

.machines file was generated using:

# create hostfile_tacc from a batch

mpiexec.hydra hostname|cut -d \. -f 1 | sort -n > hostlist_wien2k

# head of machines_kpoint

#

rm .machines

echo '#' > .machines

echo 'granularity:1' >> .machines

# list the hosts in rows for k-point parallelism

awk -v div=$1 '{_=int(NR/(div+1.0e-10))}
{a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=0;++i)print
"lapw0:"a[i]":1"}' hostlist_wien2k >>.machines

awk -v div=$2 '{_=int(NR/(div+1.0e-10))}
{a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=l;++i)print
"1:"a[i]":1"}' hostlist_wien2k >>.machines

#

# tail of machines_kpoint: allocate remaining k points one by one over all
tasks

#

echo 'extrafine:1' >>.machines

# machines_kpoint is end

# cleanup

rm hostlist_wien2k

I believe both of fftw and WIEN2k were compiled with the same intel
compilers, but I've attached my WIEN2k options in the second email. I’ve
tried using different “CORES_PER_NODE” settings (16, 64) to either match
the number of cores per node I request or the number of total cores per
node, but the error is still the same, and running x lapw0 followed by x
lapw1 -p in my job script leads to:


 LAPW0 END

forrtl: No such file or directory

forrtl: severe (28): CLOSE error, unit 200, file "Unknown"

Image              PC                Routine            Line        Source

lapw1_mpi          00000000004DCBAB  Unknown               Unknown  Unknown

lapw1_mpi          00000000004CED9F  Unknown               Unknown  Unknown

lapw1_mpi          000000000045DEE3  inilpw_                   264  inilpw.f

lapw1_mpi          0000000000462050  MAIN__                     48
lapw1_tmp_.F

lapw1_mpi          0000000000408362  Unknown               Unknown  Unknown

libc-2.28.so       0000147E06BC9CF3  __libc_start_main     Unknown  Unknown

lapw1_mpi          000000000040826E  Unknown               Unknown  Unknown

srun: error: c306-005: task 0: Exited with exit code 28

forrtl: No such file or directory

forrtl: severe (28): CLOSE error, unit 200, file "Unknown"

Any additional help/information would be greatly appreciated

Regards,

Brian Lee  |  Graduate Student

The University of Texas at Austin | Texas Materials Institute

(he/him/his)

On Thu, Mar 23, 2023 at 3:51 PM Peter Blaha <peter.blaha at tuwien.ac.at>
wrote:

> My guess would be that you link with a fftw which is compiled with
> gfortran, while wien2k is compiled with ifort (of the opposite or different
> compiler versions.....).
>
> Or it was compiled with proper compilers, but the mpi was mixed (openmpi
> vs intelmpi, ...
>
>
> You can also try to run only
>
> x lapw0     (serial, so that you get proper vsp and vns files for lapw1)
>
> x lapw1 -p    in mpi-mode. lapw1 does not link fftw (but scalapack and
> hopefully elpa).
>
>
> Otherwise your report cannot be fully correct:
>
>  You claim that you requested 2 cores for lapw0 and part of your email
> supports this .
>
> However, I do not understand why the dayfile claims to have 4 cores in
> .machine0 ???
>
> About the way wien2k launches mpi jobs: You can "see"  how it does it in
> the error logs:
>
> srun -K -N1 -n2 -r0 /home1/08844/leebrian/wien2k/lapw0_mpi lapw0.def >>
> .time00
>
> Your sysadmins can check this command and you can put this line in your
> submit script and test it.
>
> PS: In any case, you request 4 nodes and in total 64 cores.
>
> But with this .machines file you use only 2 cores in lapw0 and 16 in
> lapw1/2. This waists your cpu-hours.
>
> Check the part of your script (wien2k_tasks... ????) that generates the
> .machines file.
>
> PS: What is your CORES_PER_NODE setting ?
>
> PPS: The message from L.Marks that you need a ":number" in the .machines
> file is not true. It is perfectly ok and the same to use   node:1   or
> only      node
>
>
> Am 23.03.2023 um 19:14 schrieb Brian Lee:
>
> Hello WIEN2k users/developers,
>
> I am a graduate student at UT Austin in the MS&E program and would like to
> test
>
> WIEN2k_23.2 using various parallelization schemes. When I try to run
> “run_lapw -p” with the default MPI run command suggested during siteconfig
> along with a .machines file/job script that requests 2 processors per lapw0
> and/or 2 processors per kpt, I receive the following error:
> /index.html
>
> --
> -----------------------------------------------------------------------
> Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-158801165300
> Email: peter.blaha at tuwien.ac.at
> WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
> -------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230326/46ca5c8f/attachment.htm>


More information about the Wien mailing list