[Wien] MPI error
Peter Blaha
pblaha at theochem.tuwien.ac.at
Wed Apr 14 08:04:47 CEST 2021
It cannot initialize an mpi job, because it is missing the interface
software.
You need to ask the computing center / system administrators how one
executes a mpi job on this computer.
It could be, that "mpirun" is not supported on this machine. You may try
a wien2k installation with system "LS" in siteconfig. This will
configure the parallel environment/commands using "slurm" commands like
srun -K -N_nodes_ -n_NP_ ..., replacing mpirun.
We used it once on our hpc machine, since it was recommended by the
computing center people. However, it turned out that the standard mpirun
installation was more stable because the "slurm controller" died too
often leading to many random crashes. Anyway, if your system has what is
called "tight integration of mpi", it might be necessary.
Am 13.04.2021 um 21:47 schrieb leila mollabashi:
> Dear Prof. Peter Blaha and WIEN2k users,
>
> Then by run x lapw1 –p:
>
> starting parallel lapw1 at Tue Apr 13 21:04:15 CEST 2021
>
> -> starting parallel LAPW1 jobs at Tue Apr 13 21:04:15 CEST 2021
>
> running LAPW1 in parallel mode (using .machines)
>
> 2 number_of_parallel_jobs
>
> [1] 14530
>
> [e0467:14538] mca_base_component_repository_open: unable to open
> mca_btl_uct: libucp.so.0: cannot open shared object file: No such file
> or directory (ignored)
>
> WARNING: There was an error initializing an OpenFabrics device.
>
> Local host: e0467
>
> Local device: mlx4_0
>
> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
>
> with errorcode 0.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> You may or may not see output from other processes, depending on
>
> exactly when Open MPI kills them.
>
> --------------------------------------------------------------------------
>
> [e0467:14567] 1 more process has sent help message
> help-mpi-btl-openib.txt / error in device init
>
> [e0467:14567] 1 more process has sent help message
> help-mpi-btl-openib.txt / error in device init
>
> [e0467:14567] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> [warn] Epoll MOD(1) on fd 27 failed. Old events were 6; read change was
> 0 (none); write change was 2 (del): Bad file descriptor
>
>>Somewhere there should be some documentation how one runs an mpi job on
> your system.
>
> Only I found this:
>
> Before ordering a task, it should be encapsulated in an appropriate
> script understandable for the queue system, e.g .:
>
> /home/users/user/submit_script.sl <http://submit_script.sl>
>
> Sample SLURM script:
>
> #! / bin / bash -l
>
> #SBATCH -N 1
>
> #SBATCH --mem 5000
>
> #SBATCH --time = 20:00:00
>
> /sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>
> /sciezka/do/pliku/wyjsciowego.out
>
> To order a task to a specific queue, use the #SBATCH -p parameter, e.g.
>
> #! / bin / bash -l
>
> #SBATCH -N 1
>
> #SBATCH --mem 5000
>
> #SBATCH --time = 20:00:00
>
> #SBATCH -p standard
>
> /sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>
> /siezka/do/pliku/wyjsciowego.out
>
> The task must then be ordered using the *sbatch* command
>
> sbatch /home/users/user/submit_script.sl <http://submit_script.sl>
>
> *Ordering interactive tasks***
>
>
> Interactive tasks can be divided into two groups:
>
> ·interactive task (working in text mode)
>
> ·interactive task
>
> *Interactive task (working in text mode)***
>
>
> Ordering interactive tasks is very simple and in the simplest case it
> comes down to issuing the command below.
>
> srun --pty / bin / bash
>
> Sincerely yours,
>
> Leila Mollabashi
>
>
> On Wed, Apr 14, 2021 at 12:03 AM leila mollabashi
> <le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>> wrote:
>
> Dear Prof. Peter Blaha and WIEN2k users,
>
> Thank you for your assistances.
>
> > At least now the error: "lapw0 not found" is gone. Do you
> understand why ??
>
> Yes, I think that because now the path is clearly known.
>
> >How many slots do you get by this srun command ?
>
> Usually I went to node with 28 CPUs.
>
> >Is this the node with the name e0591 ???
>
> Yes, it is.
>
> >Of course the .machines file must be consistent (dynamically adapted)
>
> with the actual nodename.
>
> Yes, to do this I use my script.
>
> >When I use “srun --pty -n 8 /bin/bash” that goes to the node with 8 free
> cores, and run x lapw0 –p then this happens:
>
> starting parallel lapw0 at Tue Apr 13 20:50:49 CEST 2021
>
> -------- .machine0 : 4 processors
>
> [1] 12852
>
> [e0467:12859] mca_base_component_repository_open: unable to open
> mca_btl_uct: libucp.so.0: cannot open shared object file: No such
> file or directory (ignored)
>
> [e0467][[56319,1],1][btl_openib_component.c:1699:init_one_device]
> error obtaining device attributes for mlx4_0 errno says Protocol not
> supported
>
> [e0467:12859] mca_base_component_repository_open: unable to open
> mca_pml_ucx: libucp.so.0: cannot open shared object file: No such
> file or directory (ignored)
>
> LAPW0 END
>
> [1] Done mpirun -np 4 -machinefile
> .machine0 /home/users/mollabashi/v19.2/lapw0_mpi lapw0.def >> .time00
>
> Sincerely yours,
>
> Leila Mollabashi
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------
More information about the Wien
mailing list