Thank you. Those are interesting finding, especially the first one. I haven&#39;t dug that deep into the code after finding the alternative way.<br>Marry Xmas.<br clear="all">--------------------------------------------------<br>


Duy Le<br>PhD Student<br>Department of Physics<br>University of Central Florida.<br><br>&quot;Men don&#39;t need hand to do things&quot;<br>

<br><br><div class="gmail_quote">On Wed, Dec 23, 2009 at 7:08 AM, Sergiu Arapan <span dir="ltr">&lt;<a href="mailto:sergiu.arapan@gmail.com">sergiu.arapan@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Dear wien2k users and developers,<br>

<br>

I would like to post few comments on running parallel version of wien2k on a distributed memory cluster. I&#39;m using the most recent version of wien2k (09.2) on a Linux-based cluster with 805 HP ProLiant DL140 G3 nodes, each node consisting of Intel Xeon E5345 Quad Core Processor 2.33 GHz, 4 MB Level 2 cache, interconnected by Next generation Infiniband interconnect. Operating system is CentOS 5 64-bit Linux and resource manager is SLURM. I compiled source code with Intel compilers (ifort 10.1.017), Intel built OpenMPI (mpif90 1.2.7) and linked with MKL (10.0.1.014), FFTW (2.1.5) and corresponding OpenMPI libs.<br>


<br>

My first comment concerns the implementation of the MPI on fine grain parallelization. Within the current version of wien2k, the module lap2w_mpi crashes if N_noneq_atoms (number of nonequivalent atoms in case.struc file) is not a multiple of N_cpus (number of processors to run lapw2_mpi). This strange behavior was reported in a recent post by Duy Le with the subject “[Wien] MPI problem for LAPW2” (<a href="http://zeus.theochem.tuwien.ac.at/pipermail/wien/2009-September/012042.html" target="_blank">http://zeus.theochem.tuwien.ac.at/pipermail/wien/2009-September/012042.html</a>). He noticed that for a system consisting of 21 (nonequivalent) atoms the program runs only for 3 or 7 cpus. He managed to cure the problem by setting lapw2_vector_split:$N_cpus, but without a reasonable explanation. However, one can get a hint by looking at lap2w source files and the output of lapw2_mpi. Let&#39;s consider, for example, the cd16te15sb.struct from $WIENROOT/example_struct_files, which describes a structure with 5 nonequivalent atoms. Let&#39;s run it on a computer node with 8 cpus with the following .machines file:<br>


granularity:1<br>

1:n246:8<br>

lapw0:n246:8<br>

extrafine:1<br>

<br>

Here are some lines from the resulting case.dayfile:<br>

&gt; lapw0 -p (19:41:09) starting parallel lapw0 at Tue Dec 22 19:41:09 CET 2009<br>

-------- .machine0 : 8 processors<br>

mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi lapw0.def<br>

Tue Dec 22 19:41:27 CET 2009 -&gt; all processes done.<br>

….................................................................................. <br>

&gt; lapw1 -c -p (19:41:28) starting parallel lapw1 at Tue Dec 22 19:41:28 CET 2009<br>

-&gt; starting parallel LAPW1 jobs at Tue Dec 22 19:41:28 CET 2009<br>

1 number_of_parallel_jobs<br>

-------- .machine1 : 8 processors : weight 1<br>

mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi lapw1_1.def<br>

waiting for all processes to complete<br>

Tue Dec 22 19:48:26 CET 2009 -&gt; all processes done.<br>

…..................................................................................... <br>

&gt; lapw2 -c -p (19:48:28) running LAPW2 in parallel mode<br>

running parallel lapw2<br>

mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw2c_mpi lapw2_1.def 1<br>

sleeping for 1 seconds<br>

waiting for processes:<br>

** LAPW2 crashed!<br>

…..................................................................................... <br>

<br>

The job crashed with the following error message:<br>

[n246:15992] *** An error occurred in MPI_Comm_split<br>

[n246:15992] *** on communicator MPI_COMM_WORLD<br>

[n246:15992] *** MPI_ERR_ARG: invalid argument of some other kind<br>

[n246:15992] *** MPI_ERRORS_ARE_FATAL (goodbye)<br>

<br>

Now, if one take a look at case.output2_1_proc_n (n=1,2,..,7), one see the following header (here is the case of case.output2_1_proc_1):<br>

<br>

init_parallel_2 1 8 1 8 2<br>

MPI run on 8 processors in MPI_COMM_WORLD<br>

8 processors in MPI_vec_COMM (atoms splitting)<br>

1 processors in MPI_atoms_COMM (vector splitting)<br>

<br>

myid= 1<br>

myid_atm= 1<br>

myid_vec= 1<br>

<br>

time in recpr: 0.820000000000000<br>

<br>

One can find the following lines in the lapw2.F source file (lines 129-137):<br>

#ifdef Parallel<br>

write(6,*) &#39;MPI run on &#39;,npe,&#39; processors in MPI_COMM_WORLD&#39;<br>

write(6,*) &#39; &#39;,npe_atm,&#39; processors in MPI_vec_COMM (atoms splitting)&#39;<br>

write(6,*) &#39; &#39;,npe_vec,&#39; processors in MPI_atoms_COMM (vector splitting)&#39;<br>

write(6,*) &#39; myid= &#39;,myid<br>

write(6,*) &#39; myid_atm= &#39;,myid<br>

write(6,*) &#39; myid_vec= &#39;,myid<br>

write(6,*) &#39; &#39;<br>

#endif<br>

which generate this output.<br>

<br>

If I correctly understand, npe is the total number of cpus, npe_atm is the number of cpus for parallelization over atoms, and npe_vec the number of cpus for additional parallelization of the density over vectors (I think that MPI_vec_COMM and MPI_atoms_COMM should be swapped).<br>


<br>

One can also find the following lines (306-311) in the l2main.F file:<br>

! ---------------------------------<br>

! START LOOP FOR ALL ATOMS<br>

! ---------------------------------<br>

<br>

non_equiv_loop: do jatom_pe=1,nat,npe_atm<br>

jatom=jatom_pe+myid_atm<br>

<br>

from which I can understand that the loop is over nonequivalent atoms nat with step npe_atom.<br>

<br>

Now let&#39;s do some changes in the lapw2para to run lapw2_mpi on 5 cpus, and take a look at case.dayfile and case.output2_1_proc_1 files.<br>

Here are lines from case.dayfile:<br>

&gt; lapw0 -p (20:08:14) starting parallel lapw0 at Tue Dec 22 20:08:14 CET 2009<br>

Tue Dec 22 20:08:14 CET 2009 -&gt; Setting up case Cd16Te15Sb for parallel execution<br>

-------- .machine0 : 8 processors<br>

mpirun --verbose -np 8 --hostfile .machine0 $WIENROOT/lapw0_mpi lapw0.def<br>

Tue Dec 22 20:08:33 CET 2009 -&gt; all processes done.<br>

…............................................................................. <br>

&gt; lapw1 -c -p (20:08:34) starting parallel lapw1 at Tue Dec 22 20:08:34 CET 2009<br>

-&gt; starting parallel LAPW1 jobs at Tue Dec 22 20:08:34 CET 2009<br>

mpirun --verbose -np 8 --hostfile .machine1 $WIENROOT/lapw1c_mpi lapw1_1.def<br>

waiting for all processes to complete<br>

Tue Dec 22 20:15:32 CET 2009 -&gt; all processes done.<br>

…................................................................................................... <br>

&gt; lapw2 -c -p (20:15:33) running LAPW2 in parallel mode<br>

machines: n383<br>

running parallel lapw2<br>

mpirun --verbose -np 5 --hostfile .machine1 $WIENROOT/lapw2c_mpi lapw2_1.def 1<br>

sleeping for 1 seconds<br>

waiting for processes:<br>

n383 0.014u 0.008s 0:51.32 0.0% 0+0k 0+0io 0pf+0w<br>

…................................................<br>

:ENERGY convergence: 0 0 .0004326450000000<br>

:CHARGE convergence: 1 0.001 -.0000259<br>

ec cc and fc_conv 1 1 1<br>

&gt; stop<br>

<br>

This time job terminated successfully, and first lines of case.output2_1_proc_1 read as:<br>

init_parallel_2 1 5 1 5 2<br>

MPI run on 5 processors in MPI_COMM_WORLD<br>

5 processors in MPI_vec_COMM (atoms splitting)<br>

1 processors in MPI_atoms_COMM (vector splitting)<br>

<br>

myid= 1<br>

myid_atm= 1<br>

myid_vec= 1<br>

<br>

time in recpr: 0.820000000000000<br>

<br>

:POS002: <a href="http://AT.NR" target="_blank">AT.NR</a>. -2 POSITION = 0.12426 0.12426 0.12426 MULTIPLICITY = 4<br>

….................................................................................................................................... <br>

<br>

Now, let&#39;s see what is going on when lapw2_vector_split:N_cpus is set in .machines file:<br>

granularity:1<br>

1:n21:8<br>

lapw2_vector_split:8<br>

lapw0:n21:8<br>

extrafine:1<br>

<br>

One can read the following lines from case.dayfile:<br>

&gt; lapw2 -c -p (21:06:51) running LAPW2 in parallel mode<br>

machines: n21<br>

running parallel lapw2<br>

mpirun --verbose -np 8 --hostfile .machine1 /home/x_serar/wien2k/09.2/openmpi/lapw2c_mpi lapw2_1.def 1<br>

sleeping for 1 seconds<br>

waiting for processes:<br>

n21 0.027u 0.012s 1:26.55 0.0% 0+0k 0+0io 0pf+0w<br>

…....................................................................................................... <br>

:ENERGY convergence: 0 0 .0000601700000000<br>

:CHARGE convergence: 1 0.001 -.0006113<br>

ec cc and fc_conv 1 1 1<br>

&gt; stop<br>

<br>

The first lines of case.output2_1_proc_1 read as:<br>

init_parallel_2 1 8 8 1 1<br>

MPI run on 8 processors in MPI_COMM_WORLD<br>

1 processors in MPI_vec_COMM (atoms splitting)<br>

8 processors in MPI_atoms_COMM (vector splitting)<br>

<br>

myid= 1<br>

myid_atm= 1<br>

myid_vec= 1<br>

<br>

time in recpr: 0.810000000000000<br>

0 0.191606E+00 -0.475342E+00 -0.217499E+00 -0.295484E+00 0.795881E-01 4 4 4<br>

…....................................................................................................................................................... <br>

<br>

That is, there is no atom splitting and npe_atm=1 is a divisor of the number of nonequivalent atoms. This result npe_atm=1 becomes clear if one take a look at modules.F, SUBROUTINE init_parallel_2 (line 78):<br>

................................................................................... <br>

npe_atm=npe/npe_vec<br>

…............................................................................... <br>

<br>

Thus, the crash of lapw2_mpi is not related to memory issues, but to the way the parallelization is implemented. My analysis is, of course, superficial, and I can not say whether there is a bug in the lapw2_mpi module. But I think that this issue requires some attention from developers.<br>


<br>

My second comment is that you do not need to connect through ssh to allocated processors on different computer nodes in order to run lapw1(c) or lapw2(c) (the case of parallelization over k-points). You can run your parallel processes by invoking mpirun.<br>


First, set up &quot;setenv WIEN_MPIRUN &#39;mpirun -np _NP_ --hostfile _HOSTS_ _EXEC_&#39;” in $WIENROOT/parallel_options.<br>

Second, instead of line “(cd $PWD;$t $exe ${def}_$loop.def;rm -f .lock_$lockfile[$p]) &gt;&gt;.time1_$loop &amp;” in lapw1para (line 406) use the following two lines:<br>

“set ttt=(`echo $mpirun | sed -e &quot;s^_NP_^$number_per_job[$p]^&quot; -e &quot;s^_HOSTS_^.machine$p^&quot; -e &quot;s^_EXEC_^$WIENROOT/${exe} ${def}_$loop.def^&quot;`)”<br>

and<br>

“(cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) &gt;&gt;.time1_$loop &amp;”<br>

<br>

similar to mpi execution.<br>

In the same fashion, in lapw2para instead of line 314 “(cd $PWD;$t $exe ${def}_${loop}.def $loop;rm -f .lock_$lockfile[$p]) &gt;&gt;.time2_$loop &amp;” use the following 2 lines:<br>

“set ttt=(`echo $mpirun | sed -e &quot;s^_NP_^$number_per_job2[$loop]^&quot; -e &quot;s^_HOSTS_^.machine$mach[$loop]^&quot; -e &quot;s^_EXEC_^$WIENROOT/${exe} ${def}_$loop.def $loop^&quot;`)”<br>

and<br>

“(cd $PWD;$t $ttt $vector_split;rm -f .lock_$lockfile[$p]) &gt;&gt;.time2_$loop &amp;”.<br>

<br>

I hope you will find these comments useful :) .<br>

<br>

Regards and Marry Christmas,<br>

Sergiu Arapan<br>

<br>

<br>

<br>

_______________________________________________<br>

Wien mailing list<br>

<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>

<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>

</blockquote></div><br>