[Wien] Parallel calculation in more than 2 nodes

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue Jul 21 15:28:08 CEST 2020


 > parallel option file: # setenv WIEN_MPIRUN "srun -K1 _EXEC_"
 > Because of compatible issues, we don't use srun by commented the
 > WIEN_MPIRUN line in parallel option file and use the mpirun directly.

You cannot just comment the MPIRUN variable,
but if you don't want to use srun you should set it to:

setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"

(or use during siteconfig the option of ifort (without srun))

Regards

On 7/21/20 12:48 PM, MA Weiliang wrote:
> Dear WIEN2K users,
> 
> The cluster we used is a memory shared system with 16 cpus per node. The calculation distributed in 2 nodes with 32 cpus. But actually all the mpi processes were running in the first node according to the attached top ouput. There were not processes in the second nodes.  As you can see, the usage of cpu is around 50%. It seemes that the calculation didn't distribute in 2 nodes, but only splitted the fisrt node (16 cpus) into 32 prcesses with half computing power.
> 
> Do you have any ideas for this problem? The .machines, wien2k info, dayfile and job output are attached below. Thank you!
> 
> Best,
> Weiliang
> 
> 
> #========================================#
> #  output of top
> #----------------------------------------#
>    PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 43504 mc        20   0  614m 262m  27m R 50.2  0.3  21:45.54 lapw1c_mpi
> 43507 mc        20   0  611m 259m  26m R 50.2  0.3  21:50.76 lapw1c_mpi
> 43514 mc        20   0  614m 255m  22m R 50.2  0.3  21:51.37 lapw1c_mpi
> ...
> 32 lines in total
> ...
> 43508 mc        20   0  615m 260m  23m R 49.5  0.3  21:43.73 lapw1c_mpi
> 43513 mc        20   0  616m 257m  22m R 49.5  0.3  21:51.32 lapw1c_mpi
> 43565 mc        20   0  562m 265m  24m R 49.5  0.3  21:43.29 lapw1c_mpi
> 
> 
> #========================================#
> # .machines file
> #----------------------------------------#
> 1:lame26:16
> 1:lame28:16
> lapw0: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> dstart: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> nlvdw: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> lapw2_vector_split:2
> granularity:1
> extrafine:1
> 
> 
> #========================================#
> # wien2k info
> #----------------------------------------#
> wien2k version: 18.2
> complier: ifort, icc, mpiifort (intel 2017 compliers)
> parallel option file: # setenv WIEN_MPIRUN "srun -K1 _EXEC_"
> Because of compatible issues, we don't use srun by commented the WIEN_MPIRUN line in parallel option file and use the mpirun directly.
> 
> 
> #========================================#
> # dayfile
> #----------------------------------------#
>      cycle 7     (Mon Jul 20 20:56:01 CEST 2020)         (194/93 to go)
> 
>>    lapw0  -p   (20:56:01) starting parallel lapw0 at Mon Jul 20 20:56:01 CEST 2020
> -------- .machine0 : 32 processors
> 0.087u 0.176s 0:17.87 1.3%      0+0k 0+112io 0pf+0w
>>    lapw1  -p   -c      (20:56:19) starting parallel lapw1 at Mon Jul 20 20:56:19 CEST 2020
> ->  starting parallel LAPW1 jobs at Mon Jul 20 20:56:20 CEST 2020
> running LAPW1 in parallel mode (using .machines)
> 2 number_of_parallel_jobs
>       lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26(16) 0.022u 0.049s 56:37.88 0.0%
>      0+0k 0+8io 0pf+0w
>       lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28(16) 0.031u 0.038s 56:00.24 0.0%
>      0+0k 0+8io 0pf+0w
>     Summary of lapw1para:
>     lame26        k=0     user=0  wallclock=0
>     lame28        k=0     user=0  wallclock=0
> 18.849u 18.501s 56:40.85 1.0%   0+0k 0+1032io 0pf+0w
>>    lapwso -p -c        (21:53:00) running LAPWSO in parallel mode
>        lame26 0.026u 0.044s 2:20:06.55 0.0% 0+0k 0+8io 0pf+0w
>        lame28 0.027u 0.043s 2:18:40.89 0.0% 0+0k 0+8io 0pf+0w
>     Summary of lapwsopara:
>     lame26        user=0.026      wallclock=140
>     lame28        user=0.027      wallclock=138
> 0.235u 2.621s 2:20:13.57 0.0%   0+0k 0+864io 0pf+0w
>>    lapw2 -p    -c -so  (00:13:14) running LAPW2 in parallel mode
>        lame26 0.023u 0.044s 4:58.20 0.0% 0+0k 0+8io 0pf+0w
>        lame28 0.024u 0.044s 5:02.58 0.0% 0+0k 0+8io 0pf+0w
>     Summary of lapw2para:
>     lame26        user=0.023      wallclock=298.2
>     lame28        user=0.024      wallclock=302.58
> 5.836u 1.057s 5:11.94 2.2%      0+0k 0+166184io 0pf+0w
>>    lcore       (00:18:26) 1.576u 0.042s 0:02.06 78.1%  0+0k 0+12888io 0pf+0w
>>    mixer       (00:18:30) 6.472u 0.687s 0:07.97 89.7%  0+0k 0+308832io 0pf+0w
> :ENERGY convergence:  0 0.000005 .0001215250000000
> :CHARGE convergence:  0 0.00005 .0002538
> ec cc and fc_conv 0 0 1
> 
> 
> #========================================#
> # job output
> #----------------------------------------#
> in cycle 3    ETEST: .5230513600000000   CTEST: .0049036
>   LAPW0 END
> [1]    Done                          mpirun -np 32 /home/mcs/work/wma/Package/wien2k.18m/lapw0_mpi lapw0.def >> .time00
>   LAPW1 END
> [1]  - Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
>   LAPW1 END
> [2]    Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
> LAPWSO END
> LAPWSO END
> [2]    Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .timeso_$loop
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .timeso_$loop
> LAPW2 - FERMI; weights written
>   LAPW2 END
>   LAPW2 END
> [2]    Done                          ( cd $PWD; $t $ttt $vector_split; rm -f .lock_$lockfile[$p] ) >> .time2_$loop
> [1]  + Done                          ( cd $PWD; $t $ttt $vector_split; rm -f .lock_$lockfile[$p] ) >> .time2_$loop
>   SUMPARA END
>   CORE  END
>   MIXER END
> ec cc and fc_conv 0 0 1
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------


More information about the Wien mailing list