[Wien] MPI error

leila mollabashi le.mollabashi at gmail.com
Fri Apr 23 16:04:46 CEST 2021


Dear Prof. Peter Blaha and WIEN2k users,

Thank you for your assistances.

Here it is the admin reply:

   - mpirun/mpiexec command is supported after loadin propper module ( I
   suggest openmpi/4.1.0 with gcc 6.2.0 or icc )
   - you have to describe needed resources (I suggest : --nodes and
   --ntasks-per-node , please use "whole node" , so ntasks-pper-node= 28 or 32
   or 48 , depending of partition)
   - Yes, our cluster have "tight integration with mpi" but the
   other-way-arround : our MPI libraries are compiled with SLURM support, so
   when you describe resources at the beginning of batch script, you do not
   have to use "-np" and "-machinefile" options for mpirun/mpiexec


   - this error message " btl_openib_component.c:1699:init_one_device" is
   caused by "old" mpi library, so please recompile your application (WIEN2k)
   using openmpi/4.1.0_icc19

Now should I compile WIEN2k with SL or LI?

Sincerely yours,

Leila Mollabashi

On Wed, Apr 14, 2021 at 10:34 AM Peter Blaha <pblaha at theochem.tuwien.ac.at>
wrote:

> It cannot initialize an mpi job, because it is missing the interface
> software.
>
> You need to ask the computing center / system administrators how one
> executes a mpi job on this computer.
>
> It could be, that "mpirun" is not supported on this machine. You may try
> a wien2k installation with  system   "LS"  in siteconfig. This will
> configure the parallel environment/commands using "slurm" commands like
> srun -K -N_nodes_ -n_NP_  ..., replacing mpirun.
> We used it once on our hpc machine, since it was recommended by the
> computing center people. However, it turned out that the standard mpirun
> installation was more stable because the "slurm controller" died too
> often leading to many random crashes. Anyway, if your system has what is
> called "tight integration of mpi", it might be necessary.
>
> Am 13.04.2021 um 21:47 schrieb leila mollabashi:
> > Dear Prof. Peter Blaha and WIEN2k users,
> >
> > Then by run x lapw1 –p:
> >
> > starting parallel lapw1 at Tue Apr 13 21:04:15 CEST 2021
> >
> > ->  starting parallel LAPW1 jobs at Tue Apr 13 21:04:15 CEST 2021
> >
> > running LAPW1 in parallel mode (using .machines)
> >
> > 2 number_of_parallel_jobs
> >
> > [1] 14530
> >
> > [e0467:14538] mca_base_component_repository_open: unable to open
> > mca_btl_uct: libucp.so.0: cannot open shared object file: No such file
> > or directory (ignored)
> >
> > WARNING: There was an error initializing an OpenFabrics device.
> >
> >    Local host:   e0467
> >
> >    Local device: mlx4_0
> >
> > MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> >
> > with errorcode 0.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >
> > You may or may not see output from other processes, depending on
> >
> > exactly when Open MPI kills them.
> >
> >
> --------------------------------------------------------------------------
> >
> > [e0467:14567] 1 more process has sent help message
> > help-mpi-btl-openib.txt / error in device init
> >
> > [e0467:14567] 1 more process has sent help message
> > help-mpi-btl-openib.txt / error in device init
> >
> > [e0467:14567] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> > all help / error messages
> >
> > [warn] Epoll MOD(1) on fd 27 failed.  Old events were 6; read change was
> > 0 (none); write change was 2 (del): Bad file descriptor
> >
> >>Somewhere there should be some documentation how one runs an mpi job on
> > your system.
> >
> > Only I found this:
> >
> > Before ordering a task, it should be encapsulated in an appropriate
> > script understandable for the queue system, e.g .:
> >
> > /home/users/user/submit_script.sl <http://submit_script.sl>
> >
> > Sample SLURM script:
> >
> > #! / bin / bash -l
> >
> > #SBATCH -N 1
> >
> > #SBATCH --mem 5000
> >
> > #SBATCH --time = 20:00:00
> >
> > /sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>
> > /sciezka/do/pliku/wyjsciowego.out
> >
> > To order a task to a specific queue, use the #SBATCH -p parameter, e.g.
> >
> > #! / bin / bash -l
> >
> > #SBATCH -N 1
> >
> > #SBATCH --mem 5000
> >
> > #SBATCH --time = 20:00:00
> >
> > #SBATCH -p standard
> >
> > /sciezka/do/pliku/binarnego/plik_binarny.in <http://plik_binarny.in>>
> > /siezka/do/pliku/wyjsciowego.out
> >
> > The task must then be ordered using the *sbatch* command
> >
> > sbatch /home/users/user/submit_script.sl <http://submit_script.sl>
> >
> > *Ordering interactive tasks***
> >
> >
> > Interactive tasks can be divided into two groups:
> >
> > ·interactive task (working in text mode)
> >
> > ·interactive task
> >
> > *Interactive task (working in text mode)***
> >
> >
> > Ordering interactive tasks is very simple and in the simplest case it
> > comes down to issuing the command below.
> >
> > srun --pty / bin / bash
> >
> > Sincerely yours,
> >
> > Leila Mollabashi
> >
> >
> > On Wed, Apr 14, 2021 at 12:03 AM leila mollabashi
> > <le.mollabashi at gmail.com <mailto:le.mollabashi at gmail.com>> wrote:
> >
> >     Dear Prof. Peter Blaha and WIEN2k users,
> >
> >     Thank you for your assistances.
> >
> >     >  At least now the error: "lapw0 not found" is gone. Do you
> >     understand why ??
> >
> >     Yes, I think that because now the path is clearly known.
> >
> >     >How many slots do you get by this srun command ?
> >
> >     Usually I went to node with 28 CPUs.
> >
> >     >Is this the node with the name  e0591 ???
> >
> >     Yes, it is.
> >
> >     >Of course the .machines file must be consistent (dynamically
> adapted)
> >
> >     with the actual nodename.
> >
> >     Yes, to do this I use my script.
> >
> >     >When I  use “srun --pty -n 8 /bin/bash” that goes to the node with
> 8 free
> >     cores, and run x lapw0 –p then this happens:
> >
> >     starting parallel lapw0 at Tue Apr 13 20:50:49 CEST 2021
> >
> >     -------- .machine0 : 4 processors
> >
> >     [1] 12852
> >
> >     [e0467:12859] mca_base_component_repository_open: unable to open
> >     mca_btl_uct: libucp.so.0: cannot open shared object file: No such
> >     file or directory (ignored)
> >
> >     [e0467][[56319,1],1][btl_openib_component.c:1699:init_one_device]
> >     error obtaining device attributes for mlx4_0 errno says Protocol not
> >     supported
> >
> >     [e0467:12859] mca_base_component_repository_open: unable to open
> >     mca_pml_ucx: libucp.so.0: cannot open shared object file: No such
> >     file or directory (ignored)
> >
> >     LAPW0 END
> >
> >     [1]    Done                          mpirun -np 4 -machinefile
> >     .machine0 /home/users/mollabashi/v19.2/lapw0_mpi lapw0.def >> .time00
> >
> >     Sincerely yours,
> >
> >     Leila Mollabashi
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
>
> --
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
> WWW:   http://www.imc.tuwien.ac.at
> -------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20210423/9563764b/attachment-0001.htm>


More information about the Wien mailing list