[Wien] MPI stuck at lapw0

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue Oct 17 17:46:03 CEST 2017


Did you try to set MPI_REMOTE to 0 in parallel_options ???

----------------------------

Furthermore your machines file is not ok for lapw1: there is a "speed:" 
missing at the beginning.
1:n05-32:10
1:n05-38:10

Actually, with this you are still NOT running lapw1 in mpi-mode on 
multiple nodes, but you are running 2 independent mpi-jobs for lapw1, 
one on nodea and the other one on nodeb.

Would lapw1 work with just one line:

1:n05-32:10 n05-38:10

--------------------
Not related to the problems, but for performance:

10 cores (2x5 colums X rows) for one parallel lapw1mpi is usually not a 
good idea. At least "textbook-wisdom" says you should be as close as 
possible at a square distribution).
It needs testing, but my guess is that using 9 cores (3x3) is even faster ?



On 10/17/2017 05:00 PM, Luigi Maduro - TNW wrote:
> Dear WIEN2k users,
>
>
>
> I have the following problem. I am trying to do parallel computing on a
> cluster. Whenever I run a job on the cluster on one node both the  MPI
> and k-point parallelization work fine. However, when I try to go to
> several nodes the job does not do anything. The script just gets stuck
> on lapw0 whenever MPI is implemented, the k-point parallelization does
> not give a problem when running on multiple nodes. Additionally, if
> instead I try to run a job where lapw0 is running parallel only on one
> node on multiple processors, but lapw1 and lapw2 are run on multiple
> nodes, then again there is no problem.
>
> I do not get an error while running lapw0 parallel over multiple nodes:
> the job doesn’t do anything. The assigned nodes are scheduled to the
> job, but the load on the nodes stays at 0%. When I forcibly stop the job
> then I get these following errors:
>
>
>
> [mpiexec at n05-38] HYDU_sock_write (../../utils/sock/sock.c:417): write
> error (Bad file descriptor)
>
> [mpiexec at n05-38] HYD_pmcd_pmiserv_send_signal
> (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
>
> [mpiexec at n05-38] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable
> to send signal downstream
>
> [mpiexec at n05-38] HYDT_dmxu_poll_wait_for_event
> (../../tools/demux/demux_poll.c:76): callback returned error status
>
> [mpiexec at n05-38] HYD_pmci_wait_for_completion
> (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
>
> [mpiexec at n05-38] main (../../ui/mpich/mpiexec.c:1130): process manager
> error waiting for completion
>
>
> In the example above I tried to run a job on nodes n05-32 and n05-38.
>
> The operating system on the cluster is CentOS 7
> <http://www.centos.org/>. The cluster consists of a master node where
> Mau and Torque are running (PBS implementation). The cluster is set up
> so that rsh can be used instead of ssh.
>
> I am using intel’s parallel studio 2016:
>
>
>
> My .machines file looked like this:
>
> n05-32:10
>
> n05-38:10
>
> lapw0:n05-32:10 n05-38:10
>
> extrafine=1
>
> granularity=1
>
>
>
> In the script I submit for the jobs I source my .bashrc file:
> source /home/.bashrc
>
>
>
> The .bashrc file has the following lines:
> source
> /opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/bin/mklvars.sh
> intel64
>
> source
> /opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/bin/compilervars.sh
> intel64
>
> source
> /opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/bin/iccvars.sh
> intel64
>
> source
> /opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mpi/intel64/bin/mpivars.sh
> intel64
>
>
>
> export
> PATH=$PATH:/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mpi/intel64/bin
>
> export
> PATH=$PATH:/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/include
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mpi/intel64/lib
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/lib/intel64
>
>
>
>
>
> Along with the rest of the alias’ and environment variables that
> ./userconfig_lapw sets up.
>
> Some more parameters:
> System: LI
>
> ifort compiler
>
> icc c compiler
>
> mpiifort compiler
>
>
> Compiler options: -O1 –FR –mp1 –w –prec_div –pc80 –pad –ip –DINTEL_VML
> –traceback –assume buffered_io
> –I/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/include
>
>
>
> Linker flags: $(FOPT)
> –L/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/lib/intel64
> –lpthread
>
>
>
> Preprocessors flags: ‘-DParallel’
>
>
>
> R_LIBS: -lmkl_lapack95_lp64 –lmkl_intel_lp64 –lmkl_intel_thread
> –lmkl_core –openmp –lpthread –liomp5
>
> SCALAPACK:
> -L/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/lib/intel64
> –lmkl_scalapack_lp64
> -L/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mkl/lib/intel64
> –lmkl_blacs_intelmpi_lp64
>
>
>
>
>
> And my parallel options file looks like this:
>
>
>
> setenv TASKSET "no"
>
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
>
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 1
>
> setenv WIEN_GRANULARITY 1
>
> setenv DELAY 0.1
>
> setenv SLEEPY 1
>
> setenv WIEN_MPIRUN
> "/opt/ud/intel_xe_2016/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/linux/mpi/intel64/bin/mpirun
> -ppn $PBS_NUM_PPN -np _NP_ -machinefile _HOSTS_ _EXEC_"
>
> setenv CORES_PER_NODE 20
>
>
>
> I downloaded my fftw package and configured it with the options
> F77=ifort CC=icc MPICC=mpiicc
>
> Similarly for the LIBXC package: FC=ifort CC=icc
>
>
>
> Any help is appreciated.
> Luigi Maduro
>
>
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------


More information about the Wien mailing list