[Wien] Getting "Segmentation fault / execvp" error when running WIEN2k_23.2 in parallel

Brian Lee brianhlee at utexas.edu
Mon Mar 27 03:56:43 CEST 2023


Here are some of my compiler options for WIEN2k:

FOPT:-mkl -O -FR -mp1 -w -prec_div -pc80 -pad -ip -g -DINTEL_VML -DMKL_LP64
-traceback -assume buffered_io -I$(TACC_MKL_INC)

FPOPT:-mkl -O -FR -mp1 -w -prec_div -pc80 -pad -ip -g -DINTEL_VML
-DMKL_LP64 -traceback -assume buffered_io -I$(TACC_MKL_INC)

LDFLAGS:$(FOPT)
-Wl,-rpath,/scratch/tacc/apps/intel19/impi19_0/fftw3/3.3.10/lib,-rpath,/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64,-rpath,/opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64,-rpath,/usr/lib64
-L/usr/lib64 -lm -ldl -lpthread
-L/opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64
-liomp5

R_LIBS:-L/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core

RP_LIBS:$(R_LIBS)

FFTWROOT:/scratch/tacc/apps/intel19/impi19_0/fftw3/3.3.10/

FFTW_LIBNAME:fftw3

SCALAPACKROOT:/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/

SCALAPACK_LIBNAME:mkl_scalapack_lp64

BLACSROOT:/opt/intel/compilers_and_libraries_2020.1.217/linux/mkl/lib/

BLACS_LIBNAME:mkl_blacs_intelmpi_lp64

MPIRUN:srun -N _nodes_ -n _NP_ _PINNING_ _EXEC_

CORES_PER_NODE:64

On Sun, Mar 26, 2023 at 8:21 PM Brian Lee <brianhlee at utexas.edu> wrote:

> Hi, thank you for the responses
>
> Yes, sorry the dayfile was from a different test run. The run using
> "./wien2k_tasks_v4.sh 2 4" shows it as:
>
>
> >   lapw0   -p          (12:51:21) starting parallel lapw0 at Thu Mar 23
> 12:51:21 CD$
>
> -------- .machine0 : 2 processors
>
> **  lapw0 crashed!
>
> .machines file was generated using:
>
> # create hostfile_tacc from a batch
>
> mpiexec.hydra hostname|cut -d \. -f 1 | sort -n > hostlist_wien2k
>
> # head of machines_kpoint
>
> #
>
> rm .machines
>
> echo '#' > .machines
>
> echo 'granularity:1' >> .machines
>
> # list the hosts in rows for k-point parallelism
>
> awk -v div=$1 '{_=int(NR/(div+1.0e-10))}
> {a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=0;++i)print
> "lapw0:"a[i]":1"}' hostlist_wien2k >>.machines
>
> awk -v div=$2 '{_=int(NR/(div+1.0e-10))}
> {a[_]=((a[_])?a[_]FS:x)$1;l=(_>l)?_:l}END{for(i=0;i<=l;++i)print
> "1:"a[i]":1"}' hostlist_wien2k >>.machines
>
> #
>
> # tail of machines_kpoint: allocate remaining k points one by one over all
> tasks
>
> #
>
> echo 'extrafine:1' >>.machines
>
> # machines_kpoint is end
>
> # cleanup
>
> rm hostlist_wien2k
>
> I believe both of fftw and WIEN2k were compiled with the same intel
> compilers, but I've attached my WIEN2k options in the second email. I’ve
> tried using different “CORES_PER_NODE” settings (16, 64) to either match
> the number of cores per node I request or the number of total cores per
> node, but the error is still the same, and running x lapw0 followed by x
> lapw1 -p in my job script leads to:
>
>
>  LAPW0 END
>
> forrtl: No such file or directory
>
> forrtl: severe (28): CLOSE error, unit 200, file "Unknown"
>
> Image              PC                Routine            Line        Source
>
> lapw1_mpi          00000000004DCBAB  Unknown               Unknown  Unknown
>
> lapw1_mpi          00000000004CED9F  Unknown               Unknown  Unknown
>
> lapw1_mpi          000000000045DEE3  inilpw_                   264
> inilpw.f
>
> lapw1_mpi          0000000000462050  MAIN__                     48
> lapw1_tmp_.F
>
> lapw1_mpi          0000000000408362  Unknown               Unknown  Unknown
>
> libc-2.28.so       0000147E06BC9CF3  __libc_start_main     Unknown
> Unknown
>
> lapw1_mpi          000000000040826E  Unknown               Unknown  Unknown
>
> srun: error: c306-005: task 0: Exited with exit code 28
>
> forrtl: No such file or directory
>
> forrtl: severe (28): CLOSE error, unit 200, file "Unknown"
>
> Any additional help/information would be greatly appreciated
>
> Regards,
>
> Brian Lee  |  Graduate Student
>
> The University of Texas at Austin | Texas Materials Institute
>
> (he/him/his)
>
> On Thu, Mar 23, 2023 at 3:51 PM Peter Blaha <peter.blaha at tuwien.ac.at>
> wrote:
>
>> My guess would be that you link with a fftw which is compiled with
>> gfortran, while wien2k is compiled with ifort (of the opposite or different
>> compiler versions.....).
>>
>> Or it was compiled with proper compilers, but the mpi was mixed (openmpi
>> vs intelmpi, ...
>>
>>
>> You can also try to run only
>>
>> x lapw0     (serial, so that you get proper vsp and vns files for lapw1)
>>
>> x lapw1 -p    in mpi-mode. lapw1 does not link fftw (but scalapack and
>> hopefully elpa).
>>
>>
>> Otherwise your report cannot be fully correct:
>>
>>  You claim that you requested 2 cores for lapw0 and part of your email
>> supports this .
>>
>> However, I do not understand why the dayfile claims to have 4 cores in
>> .machine0 ???
>>
>> About the way wien2k launches mpi jobs: You can "see"  how it does it in
>> the error logs:
>>
>> srun -K -N1 -n2 -r0 /home1/08844/leebrian/wien2k/lapw0_mpi lapw0.def >>
>> .time00
>>
>> Your sysadmins can check this command and you can put this line in your
>> submit script and test it.
>>
>> PS: In any case, you request 4 nodes and in total 64 cores.
>>
>> But with this .machines file you use only 2 cores in lapw0 and 16 in
>> lapw1/2. This waists your cpu-hours.
>>
>> Check the part of your script (wien2k_tasks... ????) that generates the
>> .machines file.
>>
>> PS: What is your CORES_PER_NODE setting ?
>>
>> PPS: The message from L.Marks that you need a ":number" in the .machines
>> file is not true. It is perfectly ok and the same to use   node:1   or
>> only      node
>>
>>
>> Am 23.03.2023 um 19:14 schrieb Brian Lee:
>>
>> Hello WIEN2k users/developers,
>>
>> I am a graduate student at UT Austin in the MS&E program and would like
>> to test
>>
>> WIEN2k_23.2 using various parallelization schemes. When I try to run
>> “run_lapw -p” with the default MPI run command suggested during siteconfig
>> along with a .machines file/job script that requests 2 processors per lapw0
>> and/or 2 processors per kpt, I receive the following error:
>> /index.html
>>
>> --
>> -----------------------------------------------------------------------
>> Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
>> Phone: +43-158801165300
>> Email: peter.blaha at tuwien.ac.at
>> WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
>> -------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230326/cdcf5dd1/attachment.htm>


More information about the Wien mailing list