[Wien] some comments on parallel execution of wien2k
Duy Le
ttduyle at gmail.com
Wed Dec 23 20:17:14 CET 2009
Thank you. Those are interesting finding, especially the first one. I
haven't dug that deep into the code after finding the alternative way.
Marry Xmas.
--------------------------------------------------
Duy Le
PhD Student
Department of Physics
University of Central Florida.
"Men don't need hand to do things"
On Wed, Dec 23, 2009 at 7:08 AM, Sergiu Arapan <sergiu.arapan at gmail.com>wrote:
> Dear wien2k users and developers,
>
> I would like to post few comments on running parallel version of wien2k on
> a distributed memory cluster. I'm using the most recent version of wien2k
> (09.2) on a Linux-based cluster with 805 HP ProLiant DL140 G3 nodes, each
> node consisting of Intel Xeon E5345 Quad Core Processor 2.33 GHz, 4 MB Level
> 2 cache, interconnected by Next generation Infiniband interconnect.
> Operating system is CentOS 5 64-bit Linux and resource manager is SLURM. I
> compiled source code with Intel compilers (ifort 10.1.017), Intel built
> OpenMPI (mpif90 1.2.7) and linked with MKL (10.0.1.014), FFTW (2.1.5) and
> corresponding OpenMPI libs.
>
> My first comment concerns the implementation of the MPI on fine grain
> parallelization. Within the current version of wien2k, the module lap2w_mpi
> crashes if N_noneq_atoms (number of nonequivalent atoms in case.struc file)
> is not a multiple of N_cpus (number of processors to run lapw2_mpi). This
> strange behavior was reported in a recent post by Duy Le with the subject
> “[Wien] MPI problem for LAPW2” (
> http://zeus.theochem.tuwien.ac.at/pipermail/wien/2009-September/012042.html).
> He noticed that for a system consisting of 21 (nonequivalent) atoms the
> program runs only for 3 or 7 cpus. He managed to cure the problem by setting
> lapw2_vector_split:$N_cpus, but without a reasonable explanation. However,
> one can get a hint by looking at lap2w source files and the output of
> lapw2_mpi. Let's consider, for example, the cd16te15sb.struct from
> $WIENROOT/example_struct_files, which describes a structure with 5
> nonequivalent atoms. Let's run it on a computer node with 8 cpus with the
> following .machines file:
> granularity:1
> 1:n246:8
> lapw0:n246:8
> extrafine:1
>
> Here are some lines from the resulting case.dayfile:
> > lapw0 -p (19:41:09) starting parallel lapw0 at Tue Dec 22 19:41:09 CET
> 2009
> -------- .machine0 : 8 processors
> mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi lapw0.def
> Tue Dec 22 19:41:27 CET 2009 -> all processes done.
> …..................................................................................
>
> > lapw1 -c -p (19:41:28) starting parallel lapw1 at Tue Dec 22 19:41:28 CET
> 2009
> -> starting parallel LAPW1 jobs at Tue Dec 22 19:41:28 CET 2009
> 1 number_of_parallel_jobs
> -------- .machine1 : 8 processors : weight 1
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi
> lapw1_1.def
> waiting for all processes to complete
> Tue Dec 22 19:48:26 CET 2009 -> all processes done.
> ….....................................................................................
>
> > lapw2 -c -p (19:48:28) running LAPW2 in parallel mode
> running parallel lapw2
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw2c_mpi
> lapw2_1.def 1
> sleeping for 1 seconds
> waiting for processes:
> ** LAPW2 crashed!
> ….....................................................................................
>
>
> The job crashed with the following error message:
> [n246:15992] *** An error occurred in MPI_Comm_split
> [n246:15992] *** on communicator MPI_COMM_WORLD
> [n246:15992] *** MPI_ERR_ARG: invalid argument of some other kind
> [n246:15992] *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> Now, if one take a look at case.output2_1_proc_n (n=1,2,..,7), one see the
> following header (here is the case of case.output2_1_proc_1):
>
> init_parallel_2 1 8 1 8 2
> MPI run on 8 processors in MPI_COMM_WORLD
> 8 processors in MPI_vec_COMM (atoms splitting)
> 1 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.820000000000000
>
> One can find the following lines in the lapw2.F source file (lines
> 129-137):
> #ifdef Parallel
> write(6,*) 'MPI run on ',npe,' processors in MPI_COMM_WORLD'
> write(6,*) ' ',npe_atm,' processors in MPI_vec_COMM (atoms splitting)'
> write(6,*) ' ',npe_vec,' processors in MPI_atoms_COMM (vector splitting)'
> write(6,*) ' myid= ',myid
> write(6,*) ' myid_atm= ',myid
> write(6,*) ' myid_vec= ',myid
> write(6,*) ' '
> #endif
> which generate this output.
>
> If I correctly understand, npe is the total number of cpus, npe_atm is the
> number of cpus for parallelization over atoms, and npe_vec the number of
> cpus for additional parallelization of the density over vectors (I think
> that MPI_vec_COMM and MPI_atoms_COMM should be swapped).
>
> One can also find the following lines (306-311) in the l2main.F file:
> ! ---------------------------------
> ! START LOOP FOR ALL ATOMS
> ! ---------------------------------
>
> non_equiv_loop: do jatom_pe=1,nat,npe_atm
> jatom=jatom_pe+myid_atm
>
> from which I can understand that the loop is over nonequivalent atoms nat
> with step npe_atom.
>
> Now let's do some changes in the lapw2para to run lapw2_mpi on 5 cpus, and
> take a look at case.dayfile and case.output2_1_proc_1 files.
> Here are lines from case.dayfile:
> > lapw0 -p (20:08:14) starting parallel lapw0 at Tue Dec 22 20:08:14 CET
> 2009
> Tue Dec 22 20:08:14 CET 2009 -> Setting up case Cd16Te15Sb for parallel
> execution
> -------- .machine0 : 8 processors
> mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi lapw0.def
> Tue Dec 22 20:08:33 CET 2009 -> all processes done.
> ….............................................................................
>
> > lapw1 -c -p (20:08:34) starting parallel lapw1 at Tue Dec 22 20:08:34 CET
> 2009
> -> starting parallel LAPW1 jobs at Tue Dec 22 20:08:34 CET 2009
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi
> lapw1_1.def
> waiting for all processes to complete
> Tue Dec 22 20:15:32 CET 2009 -> all processes done.
> …...................................................................................................
>
> > lapw2 -c -p (20:15:33) running LAPW2 in parallel mode
> machines: n383
> running parallel lapw2
> mpirun --verbose -np 5 --hostfile .machine1 $WIENROOT/lapw2c_mpi
> lapw2_1.def 1
> sleeping for 1 seconds
> waiting for processes:
> n383 0.014u 0.008s 0:51.32 0.0% 0+0k 0+0io 0pf+0w
> …................................................
> :ENERGY convergence: 0 0 .0004326450000000
> :CHARGE convergence: 1 0.001 -.0000259
> ec cc and fc_conv 1 1 1
> > stop
>
> This time job terminated successfully, and first lines of
> case.output2_1_proc_1 read as:
> init_parallel_2 1 5 1 5 2
> MPI run on 5 processors in MPI_COMM_WORLD
> 5 processors in MPI_vec_COMM (atoms splitting)
> 1 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.820000000000000
>
> :POS002: AT.NR. -2 POSITION = 0.12426 0.12426 0.12426 MULTIPLICITY = 4
> …....................................................................................................................................
>
>
> Now, let's see what is going on when lapw2_vector_split:N_cpus is set in
> .machines file:
> granularity:1
> 1:n21:8
> lapw2_vector_split:8
> lapw0:n21:8
> extrafine:1
>
> One can read the following lines from case.dayfile:
> > lapw2 -c -p (21:06:51) running LAPW2 in parallel mode
> machines: n21
> running parallel lapw2
> mpirun --verbose -np 8 --hostfile .machine1
> /home/x_serar/wien2k/09.2/openmpi/lapw2c_mpi lapw2_1.def 1
> sleeping for 1 seconds
> waiting for processes:
> n21 0.027u 0.012s 1:26.55 0.0% 0+0k 0+0io 0pf+0w
> ….......................................................................................................
>
> :ENERGY convergence: 0 0 .0000601700000000
> :CHARGE convergence: 1 0.001 -.0006113
> ec cc and fc_conv 1 1 1
> > stop
>
> The first lines of case.output2_1_proc_1 read as:
> init_parallel_2 1 8 8 1 1
> MPI run on 8 processors in MPI_COMM_WORLD
> 1 processors in MPI_vec_COMM (atoms splitting)
> 8 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.810000000000000
> 0 0.191606E+00 -0.475342E+00 -0.217499E+00 -0.295484E+00 0.795881E-01 4 4 4
> ….......................................................................................................................................................
>
>
> That is, there is no atom splitting and npe_atm=1 is a divisor of the
> number of nonequivalent atoms. This result npe_atm=1 becomes clear if one
> take a look at modules.F, SUBROUTINE init_parallel_2 (line 78):
> ...................................................................................
>
> npe_atm=npe/npe_vec
> …...............................................................................
>
>
> Thus, the crash of lapw2_mpi is not related to memory issues, but to the
> way the parallelization is implemented. My analysis is, of course,
> superficial, and I can not say whether there is a bug in the lapw2_mpi
> module. But I think that this issue requires some attention from developers.
>
> My second comment is that you do not need to connect through ssh to
> allocated processors on different computer nodes in order to run lapw1(c) or
> lapw2(c) (the case of parallelization over k-points). You can run your
> parallel processes by invoking mpirun.
> First, set up "setenv WIEN_MPIRUN 'mpirun -np _NP_ --hostfile _HOSTS_
> _EXEC_'” in $WIENROOT/parallel_options.
> Second, instead of line “(cd $PWD;$t $exe ${def}_$loop.def;rm -f
> .lock_$lockfile[$p]) >>.time1_$loop &” in lapw1para (line 406) use the
> following two lines:
> “set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e
> "s^_HOSTS_^.machine$p^" -e "s^_EXEC_^$WIENROOT/${exe} ${def}_$loop.def^"`)”
> and
> “(cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &”
>
> similar to mpi execution.
> In the same fashion, in lapw2para instead of line 314 “(cd $PWD;$t $exe
> ${def}_${loop}.def $loop;rm -f .lock_$lockfile[$p]) >>.time2_$loop &” use
> the following 2 lines:
> “set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e
> "s^_HOSTS_^.machine$mach[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}
> ${def}_$loop.def $loop^"`)”
> and
> “(cd $PWD;$t $ttt $vector_split;rm -f .lock_$lockfile[$p]) >>.time2_$loop
> &”.
>
> I hope you will find these comments useful :) .
>
> Regards and Marry Christmas,
> Sergiu Arapan
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20091223/35a95ee6/attachment.htm>
More information about the Wien
mailing list