[Wien] Getting "Segmentation fault / execvp" error when running WIEN2k_23.2 in parallel
Brian Lee
brianhlee at utexas.edu
Mon Mar 27 03:21:27 CEST 2023
Hi, thank you for the responses
Yes, sorry the dayfile was from a different test run. The run using
"./wien2k_tasks_v4.sh 2 4" shows it as:
> lapw0 -p (12:51:21) starting parallel lapw0 at Thu Mar 23
12:51:21 CD$
-------- .machine0 : 2 processors
** lapw0 crashed!
.machines file was generated using:
# create hostfile_tacc from a batch
mpiexec.hydra hostname|cut -d \. -f 1 | sort -n > hostlist_wien2k
# head of machines_kpoint
#
rm .machines
echo '#' > .machines
echo 'granularity:1' >> .machines
# list the hosts in rows for k-point parallelism
awk -v div=$1 '{_=int(NR/(div+1.0e-10))}
{a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=0;++i)print
"lapw0:"a[i]":1"}' hostlist_wien2k >>.machines
awk -v div=$2 '{_=int(NR/(div+1.0e-10))}
{a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=l;++i)print
"1:"a[i]":1"}' hostlist_wien2k >>.machines
#
# tail of machines_kpoint: allocate remaining k points one by one over all
tasks
#
echo 'extrafine:1' >>.machines
# machines_kpoint is end
# cleanup
rm hostlist_wien2k
I believe both of fftw and WIEN2k were compiled with the same intel
compilers, but I've attached my WIEN2k options in the second email. I’ve
tried using different “CORES_PER_NODE” settings (16, 64) to either match
the number of cores per node I request or the number of total cores per
node, but the error is still the same, and running x lapw0 followed by x
lapw1 -p in my job script leads to:
LAPW0 END
forrtl: No such file or directory
forrtl: severe (28): CLOSE error, unit 200, file "Unknown"
Image PC Routine Line Source
lapw1_mpi 00000000004DCBAB Unknown Unknown Unknown
lapw1_mpi 00000000004CED9F Unknown Unknown Unknown
lapw1_mpi 000000000045DEE3 inilpw_ 264 inilpw.f
lapw1_mpi 0000000000462050 MAIN__ 48
lapw1_tmp_.F
lapw1_mpi 0000000000408362 Unknown Unknown Unknown
libc-2.28.so 0000147E06BC9CF3 __libc_start_main Unknown Unknown
lapw1_mpi 000000000040826E Unknown Unknown Unknown
srun: error: c306-005: task 0: Exited with exit code 28
forrtl: No such file or directory
forrtl: severe (28): CLOSE error, unit 200, file "Unknown"
Any additional help/information would be greatly appreciated
Regards,
Brian Lee | Graduate Student
The University of Texas at Austin | Texas Materials Institute
(he/him/his)
On Thu, Mar 23, 2023 at 3:51 PM Peter Blaha <peter.blaha at tuwien.ac.at>
wrote:
> My guess would be that you link with a fftw which is compiled with
> gfortran, while wien2k is compiled with ifort (of the opposite or different
> compiler versions.....).
>
> Or it was compiled with proper compilers, but the mpi was mixed (openmpi
> vs intelmpi, ...
>
>
> You can also try to run only
>
> x lapw0 (serial, so that you get proper vsp and vns files for lapw1)
>
> x lapw1 -p in mpi-mode. lapw1 does not link fftw (but scalapack and
> hopefully elpa).
>
>
> Otherwise your report cannot be fully correct:
>
> You claim that you requested 2 cores for lapw0 and part of your email
> supports this .
>
> However, I do not understand why the dayfile claims to have 4 cores in
> .machine0 ???
>
> About the way wien2k launches mpi jobs: You can "see" how it does it in
> the error logs:
>
> srun -K -N1 -n2 -r0 /home1/08844/leebrian/wien2k/lapw0_mpi lapw0.def >>
> .time00
>
> Your sysadmins can check this command and you can put this line in your
> submit script and test it.
>
> PS: In any case, you request 4 nodes and in total 64 cores.
>
> But with this .machines file you use only 2 cores in lapw0 and 16 in
> lapw1/2. This waists your cpu-hours.
>
> Check the part of your script (wien2k_tasks... ????) that generates the
> .machines file.
>
> PS: What is your CORES_PER_NODE setting ?
>
> PPS: The message from L.Marks that you need a ":number" in the .machines
> file is not true. It is perfectly ok and the same to use node:1 or
> only node
>
>
> Am 23.03.2023 um 19:14 schrieb Brian Lee:
>
> Hello WIEN2k users/developers,
>
> I am a graduate student at UT Austin in the MS&E program and would like to
> test
>
> WIEN2k_23.2 using various parallelization schemes. When I try to run
> “run_lapw -p” with the default MPI run command suggested during siteconfig
> along with a .machines file/job script that requests 2 processors per lapw0
> and/or 2 processors per kpt, I receive the following error:
> /index.html
>
> --
> -----------------------------------------------------------------------
> Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-158801165300
> Email: peter.blaha at tuwien.ac.at
> WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at
> -------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230326/46ca5c8f/attachment.htm>
More information about the Wien
mailing list