[Wien] MPI error
leila mollabashi
le.mollabashi at gmail.com
Sun May 2 21:18:53 CEST 2021
Dear Prof. Peter Blaha and WIEN2k users,
Now I have loaded the openmpi/4.1.0 and compiled Wine2k. The admin told me
that I can use your script in >http://www.wien2k.at/reg_user/faq/slurm.job
. I added this lines to it too:
module load openmpi/4.1.0_gcc620
module load ifort
module load mkl
but this error happened “bash: mpirun: command not found”.
In an interactive mode “x lapw0 –p” and “x lapw2 –p” are executed MPI but “x
lapw1 –p” is stoped with following error:
w2k_dispatch_signal(): received: Segmentation fault
--------------------------------------------------------------------------
I noticed that the FFTW3 and OpenMPI installed on the cluster are both
compiled by gfortan. But I have compiled WIEN2k by intel ifort. I am not
sure whether the problem originates from this inconsistency between gfortan
and ifort.
I have checked that lapw1 has compiled correctly.
Sincerely yours,
Leila
On Fri, Apr 23, 2021 at 7:26 PM Peter Blaha <pblaha at theochem.tuwien.ac.at>
wrote:
> Recompile with LI, since mpirun is supported (after loading the proper
> mpi).
>
> PS: Ask them if -np and -machinefile is still possible to use. Otherwise
> you cannot mix k-parallel and mpi parallel and for sure, for smaller
> cases it is a severe limitation to have only ONE mpi job with many
> k-points, small matrix size and many mpi cores.
>
> Am 23.04.2021 um 16:04 schrieb leila mollabashi:
> > Dear Prof. Peter Blaha and WIEN2k users,
> >
> > Thank you for your assistances.
> >
> > Here it is the admin reply:
> >
> > * mpirun/mpiexec command is supported after loadin propper module ( I
> > suggest openmpi/4.1.0 with gcc 6.2.0 or icc )
> > * you have to describe needed resources (I suggest : --nodes and
> > --ntasks-per-node , please use "whole node" , so ntasks-pper-node=
> > 28 or 32 or 48 , depending of partition)
> > * Yes, our cluster have "tight integration with mpi" but the
> > other-way-arround : our MPI libraries are compiled with SLURM
> > support, so when you describe resources at the beginning of batch
> > script, you do not have to use "-np" and "-machinefile" options for
> > mpirun/mpiexec
> >
> > * this error message " btl_openib_component.c:1699:init_one_device" is
> > caused by "old" mpi library, so please recompile your application
> > (WIEN2k) using openmpi/4.1.0_icc19
> >
> > Now should I compile WIEN2k with SL or LI?
> >
> > Sincerely yours,
> >
> > Leila Mollabashi
> >
> >
> > On Wed, Apr 14, 2021 at 10:34 AM Peter Blaha
> > <pblaha at theochem.tuwien.ac.at <mailto:pblaha at theochem.tuwien.ac.at>>
> wrote:
> >
> > It cannot initialize an mpi job, because it is missing the interface
> > software.
> >
> > You need to ask the computing center / system administrators how one
> > executes a mpi job on this computer.
> >
> > It could be, that "mpirun" is not supported on this machine. You may
> > try
> > a wien2k installation with system "LS" in siteconfig. This will
> > configure the parallel environment/commands using "slurm" commands
> like
> > srun -K -N_nodes_ -n_NP_ ..., replacing mpirun.
> > We used it once on our hpc machine, since it was recommended by the
> > computing center people. However, it turned out that the standard
> > mpirun
> > installation was more stable because the "slurm controller" died too
> > often leading to many random crashes. Anyway, if your system has
> > what is
> > called "tight integration of mpi", it might be necessary.
> >
> > Am 13.04.2021 um 21:47 schrieb leila mollabashi:
> > > Dear Prof. Peter Blaha and WIEN2k users,
> > >
> > > Then by run x lapw1 –p:
> > >
> > > starting parallel lapw1 at Tue Apr 13 21:04:15 CEST 2021
> > >
> > > -> starting parallel LAPW1 jobs at Tue Apr 13 21:04:15 CEST 2021
> > >
> > > running LAPW1 in parallel mode (using .machines)
> > >
> > > 2 number_of_parallel_jobs
> > >
> > > [1] 14530
> > >
> > > [e0467:14538] mca_base_component_repository_open: unable to open
> > > mca_btl_uct: libucp.so.0: cannot open shared object file: No such
> > file
> > > or directory (ignored)
> > >
> > > WARNING: There was an error initializing an OpenFabrics device.
> > >
> > > Local host: e0467
> > >
> > > Local device: mlx4_0
> > >
> > > MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> > >
> > > with errorcode 0.
> > >
> > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI
> processes.
> > >
> > > You may or may not see output from other processes, depending on
> > >
> > > exactly when Open MPI kills them.
> > >
> > >
> >
> --------------------------------------------------------------------------
> > >
> > > [e0467:14567] 1 more process has sent help message
> > > help-mpi-btl-openib.txt / error in device init
> > >
> > > [e0467:14567] 1 more process has sent help message
> > > help-mpi-btl-openib.txt / error in device init
> > >
> > > [e0467:14567] Set MCA parameter "orte_base_help_aggregate" to 0
> > to see
> > > all help / error messages
> > >
> > > [warn] Epoll MOD(1) on fd 27 failed. Old events were 6; read
> > change was
> > > 0 (none); write change was 2 (del): Bad file descriptor
> > >
> > >>Somewhere there should be some documentation how one runs an mpi
> > job on
> > > your system.
> > >
> > > Only I found this:
> > >
> > > Before ordering a task, it should be encapsulated in an
> appropriate
> > > script understandable for the queue system, e.g .:
> > >
> > > /home/users/user/submit_script.sl <http://submit_script.sl>
> > <http://submit_script.sl <http://submit_script.sl>>
> > >
> > > Sample SLURM script:
> > >
> > > #! / bin / bash -l
> > >
> > > #SBATCH -N 1
> > >
> > > #SBATCH --mem 5000
> > >
> > > #SBATCH --time = 20:00:00
> > >
> > > /sciezka/do/pliku/binarnego/plik_binarny.in
> > <http://plik_binarny.in> <http://plik_binarny.in
> > <http://plik_binarny.in>>>
> > > /sciezka/do/pliku/wyjsciowego.out
> > >
> > > To order a task to a specific queue, use the #SBATCH -p
> > parameter, e.g.
> > >
> > > #! / bin / bash -l
> > >
> > > #SBATCH -N 1
> > >
> > > #SBATCH --mem 5000
> > >
> > > #SBATCH --time = 20:00:00
> > >
> > > #SBATCH -p standard
> > >
> > > /sciezka/do/pliku/binarnego/plik_binarny.in
> > <http://plik_binarny.in> <http://plik_binarny.in
> > <http://plik_binarny.in>>>
> > > /siezka/do/pliku/wyjsciowego.out
> > >
> > > The task must then be ordered using the *sbatch* command
> > >
> > > sbatch /home/users/user/submit_script.sl
> > <http://submit_script.sl> <http://submit_script.sl
> > <http://submit_script.sl>>
> > >
> > > *Ordering interactive tasks***
> > >
> > >
> > > Interactive tasks can be divided into two groups:
> > >
> > > ·interactive task (working in text mode)
> > >
> > > ·interactive task
> > >
> > > *Interactive task (working in text mode)***
> > >
> > >
> > > Ordering interactive tasks is very simple and in the simplest
> > case it
> > > comes down to issuing the command below.
> > >
> > > srun --pty / bin / bash
> > >
> > > Sincerely yours,
> > >
> > > Leila Mollabashi
> > >
> > >
> > > On Wed, Apr 14, 2021 at 12:03 AM leila mollabashi
> > > <le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>
> > <mailto:le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>>>
> > wrote:
> > >
> > > Dear Prof. Peter Blaha and WIEN2k users,
> > >
> > > Thank you for your assistances.
> > >
> > > > At least now the error: "lapw0 not found" is gone. Do you
> > > understand why ??
> > >
> > > Yes, I think that because now the path is clearly known.
> > >
> > > >How many slots do you get by this srun command ?
> > >
> > > Usually I went to node with 28 CPUs.
> > >
> > > >Is this the node with the name e0591 ???
> > >
> > > Yes, it is.
> > >
> > > >Of course the .machines file must be consistent (dynamically
> > adapted)
> > >
> > > with the actual nodename.
> > >
> > > Yes, to do this I use my script.
> > >
> > > >When I use “srun --pty -n 8 /bin/bash” that goes to the
> > node with 8 free
> > > cores, and run x lapw0 –p then this happens:
> > >
> > > starting parallel lapw0 at Tue Apr 13 20:50:49 CEST 2021
> > >
> > > -------- .machine0 : 4 processors
> > >
> > > [1] 12852
> > >
> > > [e0467:12859] mca_base_component_repository_open: unable to
> open
> > > mca_btl_uct: libucp.so.0: cannot open shared object file: No
> such
> > > file or directory (ignored)
> > >
> > >
> [e0467][[56319,1],1][btl_openib_component.c:1699:init_one_device]
> > > error obtaining device attributes for mlx4_0 errno says
> > Protocol not
> > > supported
> > >
> > > [e0467:12859] mca_base_component_repository_open: unable to
> open
> > > mca_pml_ucx: libucp.so.0: cannot open shared object file: No
> such
> > > file or directory (ignored)
> > >
> > > LAPW0 END
> > >
> > > [1] Done mpirun -np 4 -machinefile
> > > .machine0 /home/users/mollabashi/v19.2/lapw0_mpi lapw0.def >>
> > .time00
> > >
> > > Sincerely yours,
> > >
> > > Leila Mollabashi
> > >
> > >
> > > _______________________________________________
> > > Wien mailing list
> > > Wien at zeus.theochem.tuwien.ac.at
> > <mailto:Wien at zeus.theochem.tuwien.ac.at>
> > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
> > > SEARCH the MAILING-LIST at:
> >
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> > <
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
> > >
> >
> > --
> >
> --------------------------------------------------------------------------
> > Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> > Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> > Email: blaha at theochem.tuwien.ac.at
> > <mailto:blaha at theochem.tuwien.ac.at> WIEN2k: http://www.wien2k.at
> > <http://www.wien2k.at>
> > WWW: http://www.imc.tuwien.ac.at <http://www.imc.tuwien.ac.at>
> >
> -------------------------------------------------------------------------
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at <mailto:
> Wien at zeus.theochem.tuwien.ac.at>
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
> > SEARCH the MAILING-LIST at:
> >
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> > <
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
>
> --
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
> WWW: http://www.imc.tuwien.ac.at
> -------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20210502/4ef4056f/attachment.htm>
More information about the Wien
mailing list