[Wien] MPI error

Peter Blaha pblaha at theochem.tuwien.ac.at
Fri Apr 23 16:56:39 CEST 2021


Recompile with LI, since mpirun is supported (after loading the proper mpi).

PS: Ask them if -np and -machinefile is still possible to use. Otherwise 
you cannot mix k-parallel and mpi parallel and for sure, for smaller 
cases it is a severe limitation to have only ONE mpi job with many 
k-points, small matrix size and many mpi cores.

Am 23.04.2021 um 16:04 schrieb leila mollabashi:
> Dear Prof. Peter Blaha and WIEN2k users,
> 
> Thank you for your assistances.
> 
> Here it is the admin reply:
> 
>   * mpirun/mpiexec command is supported after loadin propper module ( I
>     suggest openmpi/4.1.0 with gcc 6.2.0 or icc )
>   * you have to describe needed resources (I suggest : --nodes and
>     --ntasks-per-node , please use "whole node" , so ntasks-pper-node=
>     28 or 32 or 48 , depending of partition)
>   * Yes, our cluster have "tight integration with mpi" but the
>     other-way-arround : our MPI libraries are compiled with SLURM
>     support, so when you describe resources at the beginning of batch
>     script, you do not have to use "-np" and "-machinefile" options for
>     mpirun/mpiexec
> 
>   * this error message " btl_openib_component.c:1699:init_one_device" is
>     caused by "old" mpi library, so please recompile your application
>     (WIEN2k) using openmpi/4.1.0_icc19
> 
> Now should I compile WIEN2k with SL or LI?
> 
> Sincerely yours,
> 
> Leila Mollabashi
> 
> 
> On Wed, Apr 14, 2021 at 10:34 AM Peter Blaha 
> <pblaha at theochem.tuwien.ac.at <mailto:pblaha at theochem.tuwien.ac.at>> wrote:
> 
>     It cannot initialize an mpi job, because it is missing the interface
>     software.
> 
>     You need to ask the computing center / system administrators how one
>     executes a mpi job on this computer.
> 
>     It could be, that "mpirun" is not supported on this machine. You may
>     try
>     a wien2k installation with  system   "LS"  in siteconfig. This will
>     configure the parallel environment/commands using "slurm" commands like
>     srun -K -N_nodes_ -n_NP_  ..., replacing mpirun.
>     We used it once on our hpc machine, since it was recommended by the
>     computing center people. However, it turned out that the standard
>     mpirun
>     installation was more stable because the "slurm controller" died too
>     often leading to many random crashes. Anyway, if your system has
>     what is
>     called "tight integration of mpi", it might be necessary.
> 
>     Am 13.04.2021 um 21:47 schrieb leila mollabashi:
>      > Dear Prof. Peter Blaha and WIEN2k users,
>      >
>      > Then by run x lapw1 –p:
>      >
>      > starting parallel lapw1 at Tue Apr 13 21:04:15 CEST 2021
>      >
>      > ->  starting parallel LAPW1 jobs at Tue Apr 13 21:04:15 CEST 2021
>      >
>      > running LAPW1 in parallel mode (using .machines)
>      >
>      > 2 number_of_parallel_jobs
>      >
>      > [1] 14530
>      >
>      > [e0467:14538] mca_base_component_repository_open: unable to open
>      > mca_btl_uct: libucp.so.0: cannot open shared object file: No such
>     file
>      > or directory (ignored)
>      >
>      > WARNING: There was an error initializing an OpenFabrics device.
>      >
>      >    Local host:   e0467
>      >
>      >    Local device: mlx4_0
>      >
>      > MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
>      >
>      > with errorcode 0.
>      >
>      > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>      >
>      > You may or may not see output from other processes, depending on
>      >
>      > exactly when Open MPI kills them.
>      >
>      >
>     --------------------------------------------------------------------------
>      >
>      > [e0467:14567] 1 more process has sent help message
>      > help-mpi-btl-openib.txt / error in device init
>      >
>      > [e0467:14567] 1 more process has sent help message
>      > help-mpi-btl-openib.txt / error in device init
>      >
>      > [e0467:14567] Set MCA parameter "orte_base_help_aggregate" to 0
>     to see
>      > all help / error messages
>      >
>      > [warn] Epoll MOD(1) on fd 27 failed.  Old events were 6; read
>     change was
>      > 0 (none); write change was 2 (del): Bad file descriptor
>      >
>      >>Somewhere there should be some documentation how one runs an mpi
>     job on
>      > your system.
>      >
>      > Only I found this:
>      >
>      > Before ordering a task, it should be encapsulated in an appropriate
>      > script understandable for the queue system, e.g .:
>      >
>      > /home/users/user/submit_script.sl <http://submit_script.sl>
>     <http://submit_script.sl <http://submit_script.sl>>
>      >
>      > Sample SLURM script:
>      >
>      > #! / bin / bash -l
>      >
>      > #SBATCH -N 1
>      >
>      > #SBATCH --mem 5000
>      >
>      > #SBATCH --time = 20:00:00
>      >
>      > /sciezka/do/pliku/binarnego/plik_binarny.in
>     <http://plik_binarny.in> <http://plik_binarny.in
>     <http://plik_binarny.in>>>
>      > /sciezka/do/pliku/wyjsciowego.out
>      >
>      > To order a task to a specific queue, use the #SBATCH -p
>     parameter, e.g.
>      >
>      > #! / bin / bash -l
>      >
>      > #SBATCH -N 1
>      >
>      > #SBATCH --mem 5000
>      >
>      > #SBATCH --time = 20:00:00
>      >
>      > #SBATCH -p standard
>      >
>      > /sciezka/do/pliku/binarnego/plik_binarny.in
>     <http://plik_binarny.in> <http://plik_binarny.in
>     <http://plik_binarny.in>>>
>      > /siezka/do/pliku/wyjsciowego.out
>      >
>      > The task must then be ordered using the *sbatch* command
>      >
>      > sbatch /home/users/user/submit_script.sl
>     <http://submit_script.sl> <http://submit_script.sl
>     <http://submit_script.sl>>
>      >
>      > *Ordering interactive tasks***
>      >
>      >
>      > Interactive tasks can be divided into two groups:
>      >
>      > ·interactive task (working in text mode)
>      >
>      > ·interactive task
>      >
>      > *Interactive task (working in text mode)***
>      >
>      >
>      > Ordering interactive tasks is very simple and in the simplest
>     case it
>      > comes down to issuing the command below.
>      >
>      > srun --pty / bin / bash
>      >
>      > Sincerely yours,
>      >
>      > Leila Mollabashi
>      >
>      >
>      > On Wed, Apr 14, 2021 at 12:03 AM leila mollabashi
>      > <le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>
>     <mailto:le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>>>
>     wrote:
>      >
>      >     Dear Prof. Peter Blaha and WIEN2k users,
>      >
>      >     Thank you for your assistances.
>      >
>      >     >  At least now the error: "lapw0 not found" is gone. Do you
>      >     understand why ??
>      >
>      >     Yes, I think that because now the path is clearly known.
>      >
>      >     >How many slots do you get by this srun command ?
>      >
>      >     Usually I went to node with 28 CPUs.
>      >
>      >     >Is this the node with the name  e0591 ???
>      >
>      >     Yes, it is.
>      >
>      >     >Of course the .machines file must be consistent (dynamically
>     adapted)
>      >
>      >     with the actual nodename.
>      >
>      >     Yes, to do this I use my script.
>      >
>      >     >When I  use “srun --pty -n 8 /bin/bash” that goes to the
>     node with 8 free
>      >     cores, and run x lapw0 –p then this happens:
>      >
>      >     starting parallel lapw0 at Tue Apr 13 20:50:49 CEST 2021
>      >
>      >     -------- .machine0 : 4 processors
>      >
>      >     [1] 12852
>      >
>      >     [e0467:12859] mca_base_component_repository_open: unable to open
>      >     mca_btl_uct: libucp.so.0: cannot open shared object file: No such
>      >     file or directory (ignored)
>      >
>      >     [e0467][[56319,1],1][btl_openib_component.c:1699:init_one_device]
>      >     error obtaining device attributes for mlx4_0 errno says
>     Protocol not
>      >     supported
>      >
>      >     [e0467:12859] mca_base_component_repository_open: unable to open
>      >     mca_pml_ucx: libucp.so.0: cannot open shared object file: No such
>      >     file or directory (ignored)
>      >
>      >     LAPW0 END
>      >
>      >     [1]    Done                          mpirun -np 4 -machinefile
>      >     .machine0 /home/users/mollabashi/v19.2/lapw0_mpi lapw0.def >>
>     .time00
>      >
>      >     Sincerely yours,
>      >
>      >     Leila Mollabashi
>      >
>      >
>      > _______________________________________________
>      > Wien mailing list
>      > Wien at zeus.theochem.tuwien.ac.at
>     <mailto:Wien at zeus.theochem.tuwien.ac.at>
>      > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>      > SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>      >
> 
>     -- 
>     --------------------------------------------------------------------------
>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>     Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>     Email: blaha at theochem.tuwien.ac.at
>     <mailto:blaha at theochem.tuwien.ac.at>    WIEN2k: http://www.wien2k.at
>     <http://www.wien2k.at>
>     WWW: http://www.imc.tuwien.ac.at <http://www.imc.tuwien.ac.at>
>     -------------------------------------------------------------------------
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>     SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------


More information about the Wien mailing list