[Wien] some comments on parallel execution of wien2k

Mon Dec 28 06:32:23 CET 2009

I tested cd16te15sb-case (1 k-point) on CentOS 5 64bit using WIEN2k_09.2
(Release 29/9/2009) + ifort 11.0.074 + Intel MKL 10.1.0.015 + MVAPICH2.
I had no problems. Here is my machines-file:

$ cat .machines
granularity:1
1:node26 node26 node26 node26 node26 node26 node26 node26
lapw0:node26:8

Oleg

--
Oleg Rubel, PhD
Thunder Bay Regional Research Institute
290 Munro St, Thunder Bay, P7A  7T1, Ontario, Canada
Homepage: http://www.tbrri.com/~orubel/

>>> Laurence Marks <L-marks at northwestern.edu> 12/24/09 1:16 AM >>>
I have not yet had a chance to install the very latest version of
Wien2k, but I can say, unconditionally, that with the version before
the current one there is NO problem running lapw2_mpi.

I would worry a bit about OpenMPI - have you tried mvapich?

On Wed, Dec 23, 2009 at 6:08 AM, Sergiu Arapan <sergiu.arapan at gmail.com>
wrote:
> Dear wien2k users and developers,
>
> I would like to post few comments on running parallel version of
wien2k on a
> distributed memory cluster. I'm using the most recent version of
wien2k
> (09.2) on a Linux-based cluster with 805 HP ProLiant DL140 G3 nodes,
each
> node consisting of Intel Xeon E5345 Quad Core Processor 2.33 GHz, 4 MB
Level
> 2 cache, interconnected by Next generation Infiniband interconnect.
> Operating system is CentOS 5 64-bit Linux and resource manager is
SLURM. I
> compiled source code with Intel compilers (ifort 10.1.017), Intel
built
> OpenMPI (mpif90 1.2.7) and linked with MKL (10.0.1.014), FFTW (2.1.5)
and
> corresponding OpenMPI libs.
>
> My first comment concerns the implementation of the MPI on fine grain
> parallelization. Within the current version of wien2k, the module
lap2w_mpi
> crashes if N_noneq_atoms (number of nonequivalent atoms in case.struc
file)
> is not a multiple of N_cpus (number of processors to run lapw2_mpi).
This
> strange behavior was reported in a recent post by Duy Le with the
subject
> “[Wien] MPI problem for LAPW2”
>
(http://zeus.theochem.tuwien.ac.at/pipermail/wien/2009-September/012042.html).
> He noticed that for a system consisting of 21 (nonequivalent) atoms
the
> program runs only for 3 or 7 cpus. He managed to cure the problem by
setting
> lapw2_vector_split:$N_cpus, but without a reasonable explanation.
However,
> one can get a hint by looking at lap2w source files and the output of
> lapw2_mpi. Let's consider, for example, the cd16te15sb.struct from
> $WIENROOT/example_struct_files, which describes a structure with 5
> nonequivalent atoms. Let's run it on a computer node with 8 cpus with
the
> following .machines file:
> granularity:1
> 1:n246:8
> lapw0:n246:8
> extrafine:1
>
> Here are some lines from the resulting case.dayfile:
>> lapw0 -p (19:41:09) starting parallel lapw0 at Tue Dec 22 19:41:09
CET
>> 2009
> -------- .machine0 : 8 processors
> mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi
lapw0.def
> Tue Dec 22 19:41:27 CET 2009 -> all processes done.
>
…..................................................................................
>> lapw1 -c -p (19:41:28) starting parallel lapw1 at Tue Dec 22 19:41:28
CET
>> 2009
> -> starting parallel LAPW1 jobs at Tue Dec 22 19:41:28 CET 2009
> 1 number_of_parallel_jobs
> -------- .machine1 : 8 processors : weight 1
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi
lapw1_1.def
> waiting for all processes to complete
> Tue Dec 22 19:48:26 CET 2009 -> all processes done.
>
….....................................................................................
>> lapw2 -c -p (19:48:28) running LAPW2 in parallel mode
> running parallel lapw2
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw2c_mpi
lapw2_1.def
> 1
> sleeping for 1 seconds
> waiting for processes:
> ** LAPW2 crashed!
>
….....................................................................................
>
> The job crashed with the following error message:
> [n246:15992] *** An error occurred in MPI_Comm_split
> [n246:15992] *** on communicator MPI_CO> [n246:15992] *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> Now, if one take a look at case.output2_1_proc_n (n=1,2,..,7), one see
the
> following header (here is the case of case.output2_1_proc_1):
>
> init_parallel_2 1 8 1 8 2
> MPI run on 8 processors in MPI_COMM_WORLD
> 8 processors in MPI_vec_COMM (atoms splitting)
> 1 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.820000000000000
>
> One can find the following lines in the lapw2.F source file (lines
129-137):
> #ifdef Parallel
> write(6,*) 'MPI run on ',npe,' processors in MPI_COMM_WORLD'
> write(6,*) ' ',npe_atm,' processors in MPI_vec_COMM (atoms splitting)'
> write(6,*) ' ',npe_vec,' processors in MPI_atoms_COMM (vector
splitting)'
> write(6,*) ' myid= ',myid
> write(6,*) ' myid_atm= ',myid
> write(6,*) ' myid_vec= ',myid
> write(6,*) ' '
> #endif
> which generate this output.
>
> If I correctly understand, npe is the total number of cpus, npe_atm is
the
> number of cpus for parallelization over atoms, and npe_vec the number
of
> cpus for additional parallelization of the density over vectors (I
think
> that MPI_vec_COMM and MPI_atoms_COMM should be swapped).
>
> One can also find the following lines (306-311) in the l2main.F file:
> ! ---------------------------------
> ! START LOOP FOR ALL ATOMS
> ! ---------------------------------
>
> non_equiv_loop: do jatom_pe=1,nat,npe_atm
> jatom=jatom_pe+myid_atm
>
> from which I can understand that the loop is over nonequivalent atoms
nat
> with step npe_atom.
>
> Now let's do some changes in the lapw2para to run lapw2_mpi on 5 cpus,
and
> take a look at case.dayfile and case.output2_1_proc_1 files.
> Here are lines from case.dayfile:
>> lapw0 -p (20:08:14) starting parallel lapw0 at Tue Dec 22 20:08:14
CET
>> 2009
> Tue Dec 22 20:08:14 CET 2009 -> Setting up case Cd16Te15Sb for
parallel
> execution
> -------- .machine0 : 8 processors
> mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi
lapw0.def
> Tue Dec 22 20:08:33 CET 2009 -> all processes done.
>
….............................................................................
>> lapw1 -c -p (20:08:34) starting parallel lapw1 at Tue Dec 22 20:08:34
CET
>> 2009
> -> starting parallel LAPW1 jobs at Tue Dec 22 20:08:34 CET 2009
> mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi
lapw1_1.def
> waiting for all processes to complete
> Tue Dec 22 20:15:32 CET 2009 -> all processes done.
>
…...................................................................................................
>> lapw2 -c -p (20:15:33) running LAPW2 in parallel mode
> machines: n383
> running parallel lapw2
> mpirun --verbose -np 5 --hostfile .machine1 $WIENROOT/lapw2c_mpi
lapw2_1.def
> 1
> sleeping for 1 seconds
> waiting for processes:
> n383 0.014u 0.008s 0:51.32 0.0% 0+0k 0+0io 0pf+0w
> …................................................
> :ENERGY convergence: 0 0 .0004326450000000
> :CHARGE convergence: 1 0.001 -.0000259
> ec cc and fc_conv 1 1 1
>> stop
>
> This time job terminated successfully, and first lines of
> case.output2_1_proc_1 read as:
> init_parallel_2 1 5 1 5 2
> MPI run on 5 processors in MPI_COMM_WORLD
> 5 processors in MPI_vec_COMM (atoms splitting)
> 1 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.820000000000000
>
> :POS002: AT.NR. -2 POSITION = 0.12426 0.12426 0.12426 MULTIPLICITY = 4
>
…....................................................................................................................................
>
> Now, let's see what is going on when lapw2_vector_split:N_cpus is set
in
> .machines file:
> granularity:1
> 1:n21:8
> lapw2_vector_split:8
> lapw0:n21:8
> extrafine:1
>
> One can read the following lines from case.dayfile:
>> lapw2 -c -p (21:06:51) running LAPW2 in parallel mode
> machines: n21
> running parallel lapw2
> mpirun --verbose -np 8 --hostfile .machine1
> /home/x_serar> n21 0.027u 0.012s 1:26.55 0.0% 0+0k 0+0io 0pf+0w
>
….......................................................................................................
> :ENERGY convergence: 0 0 .0000601700000000
> :CHARGE convergence: 1 0.001 -.0006113
> ec cc and fc_conv 1 1 1
>> stop
>
> The first lines of case.output2_1_proc_1 read as:
> init_parallel_2 1 8 8 1 1
> MPI run on 8 processors in MPI_COMM_WORLD
> 1 processors in MPI_vec_COMM (atoms splitting)
> 8 processors in MPI_atoms_COMM (vector splitting)
>
> myid= 1
> myid_atm= 1
> myid_vec= 1
>
> time in recpr: 0.810000000000000
> 0 0.191606E+00 -0.475342E+00 -0.217499E+00 -0.295484E+00 0.795881E-01
4 4 4
>
….......................................................................................................................................................
>
> That is, there is no atom splitting and npe_atm=1 is a divisor of the
number
> of nonequivalent atoms. This result npe_atm=1 becomes clear if one
take a
> look at modules.F, SUBROUTINE init_parallel_2 (line 78):
>
...................................................................................
> npe_atm=npe/npe_vec
>
…...............................................................................
>
> Thus, the crash of lapw2_mpi is not related to memory issues, but to
the way
> the parallelization is implemented. My analysis is, of course,
superficial,
> and I can not say whether there is a bug in the lapw2_mpi module. But
I
> think that this issue requires some attention from developers.
>
> My second comment is that you do not need to connect through ssh to
> allocated processors on different computer nodes in order to run
lapw1(c) or
> lapw2(c) (the case of parallelization over k-points). You can run your
> parallel processes by invoking mpirun.
> First, set up "setenv WIEN_MPIRUN 'mpirun -np _NP_ --hostfile _HOSTS_
> _EXEC_'” in $WIENROOT/parallel_options.
> Second, instead of line “(cd $PWD;$t $exe ${def}_$loop.def;rm -f
> .lock_$lockfile[$p]) >>.time1_$loop &” in lapw1para (line 406) use the
> following two lines:
> “set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e
> "s^_HOSTS_^.machine$p^" -e "s^_EXEC_^$WIENROOT/${exe}
${def}_$loop.def^"`)”
> and
> “(cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &”
>
> similar to mpi execution.
> In the same fashion, in lapw2para instead of line 314 “(cd $PWD;$t
$exe
> ${def}_${loop}.def $loop;rm -f .lock_$lockfile[$p]) >>.time2_$loop &”
use
> the following 2 lines:
> “set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e
> "s^_HOSTS_^.machine$mach[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}
> ${def}_$loop.def $loop^"`)”
> and
> “(cd $PWD;$t $ttt $vector_split;rm -f .lock_$lockfile[$p])
>>.time2_$loop
> &”.
>
> I hope you will find these comments useful :) .
>
> Regards and Marry Christmas,
> Sergiu Arapan
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien