[Wien] Parallel calculation in more than 2 nodes
Peter Blaha
pblaha at theochem.tuwien.ac.at
Tue Jul 21 15:28:08 CEST 2020
> parallel option file: # setenv WIEN_MPIRUN "srun -K1 _EXEC_"
> Because of compatible issues, we don't use srun by commented the
> WIEN_MPIRUN line in parallel option file and use the mpirun directly.
You cannot just comment the MPIRUN variable,
but if you don't want to use srun you should set it to:
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
(or use during siteconfig the option of ifort (without srun))
Regards
On 7/21/20 12:48 PM, MA Weiliang wrote:
> Dear WIEN2K users,
>
> The cluster we used is a memory shared system with 16 cpus per node. The calculation distributed in 2 nodes with 32 cpus. But actually all the mpi processes were running in the first node according to the attached top ouput. There were not processes in the second nodes. As you can see, the usage of cpu is around 50%. It seemes that the calculation didn't distribute in 2 nodes, but only splitted the fisrt node (16 cpus) into 32 prcesses with half computing power.
>
> Do you have any ideas for this problem? The .machines, wien2k info, dayfile and job output are attached below. Thank you!
>
> Best,
> Weiliang
>
>
> #========================================#
> # output of top
> #----------------------------------------#
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 43504 mc 20 0 614m 262m 27m R 50.2 0.3 21:45.54 lapw1c_mpi
> 43507 mc 20 0 611m 259m 26m R 50.2 0.3 21:50.76 lapw1c_mpi
> 43514 mc 20 0 614m 255m 22m R 50.2 0.3 21:51.37 lapw1c_mpi
> ...
> 32 lines in total
> ...
> 43508 mc 20 0 615m 260m 23m R 49.5 0.3 21:43.73 lapw1c_mpi
> 43513 mc 20 0 616m 257m 22m R 49.5 0.3 21:51.32 lapw1c_mpi
> 43565 mc 20 0 562m 265m 24m R 49.5 0.3 21:43.29 lapw1c_mpi
>
>
> #========================================#
> # .machines file
> #----------------------------------------#
> 1:lame26:16
> 1:lame28:16
> lapw0: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> dstart: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> nlvdw: lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28
> lapw2_vector_split:2
> granularity:1
> extrafine:1
>
>
> #========================================#
> # wien2k info
> #----------------------------------------#
> wien2k version: 18.2
> complier: ifort, icc, mpiifort (intel 2017 compliers)
> parallel option file: # setenv WIEN_MPIRUN "srun -K1 _EXEC_"
> Because of compatible issues, we don't use srun by commented the WIEN_MPIRUN line in parallel option file and use the mpirun directly.
>
>
> #========================================#
> # dayfile
> #----------------------------------------#
> cycle 7 (Mon Jul 20 20:56:01 CEST 2020) (194/93 to go)
>
>> lapw0 -p (20:56:01) starting parallel lapw0 at Mon Jul 20 20:56:01 CEST 2020
> -------- .machine0 : 32 processors
> 0.087u 0.176s 0:17.87 1.3% 0+0k 0+112io 0pf+0w
>> lapw1 -p -c (20:56:19) starting parallel lapw1 at Mon Jul 20 20:56:19 CEST 2020
> -> starting parallel LAPW1 jobs at Mon Jul 20 20:56:20 CEST 2020
> running LAPW1 in parallel mode (using .machines)
> 2 number_of_parallel_jobs
> lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26 lame26(16) 0.022u 0.049s 56:37.88 0.0%
> 0+0k 0+8io 0pf+0w
> lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28 lame28(16) 0.031u 0.038s 56:00.24 0.0%
> 0+0k 0+8io 0pf+0w
> Summary of lapw1para:
> lame26 k=0 user=0 wallclock=0
> lame28 k=0 user=0 wallclock=0
> 18.849u 18.501s 56:40.85 1.0% 0+0k 0+1032io 0pf+0w
>> lapwso -p -c (21:53:00) running LAPWSO in parallel mode
> lame26 0.026u 0.044s 2:20:06.55 0.0% 0+0k 0+8io 0pf+0w
> lame28 0.027u 0.043s 2:18:40.89 0.0% 0+0k 0+8io 0pf+0w
> Summary of lapwsopara:
> lame26 user=0.026 wallclock=140
> lame28 user=0.027 wallclock=138
> 0.235u 2.621s 2:20:13.57 0.0% 0+0k 0+864io 0pf+0w
>> lapw2 -p -c -so (00:13:14) running LAPW2 in parallel mode
> lame26 0.023u 0.044s 4:58.20 0.0% 0+0k 0+8io 0pf+0w
> lame28 0.024u 0.044s 5:02.58 0.0% 0+0k 0+8io 0pf+0w
> Summary of lapw2para:
> lame26 user=0.023 wallclock=298.2
> lame28 user=0.024 wallclock=302.58
> 5.836u 1.057s 5:11.94 2.2% 0+0k 0+166184io 0pf+0w
>> lcore (00:18:26) 1.576u 0.042s 0:02.06 78.1% 0+0k 0+12888io 0pf+0w
>> mixer (00:18:30) 6.472u 0.687s 0:07.97 89.7% 0+0k 0+308832io 0pf+0w
> :ENERGY convergence: 0 0.000005 .0001215250000000
> :CHARGE convergence: 0 0.00005 .0002538
> ec cc and fc_conv 0 0 1
>
>
> #========================================#
> # job output
> #----------------------------------------#
> in cycle 3 ETEST: .5230513600000000 CTEST: .0049036
> LAPW0 END
> [1] Done mpirun -np 32 /home/mcs/work/wma/Package/wien2k.18m/lapw0_mpi lapw0.def >> .time00
> LAPW1 END
> [1] - Done ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
> LAPW1 END
> [2] Done ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
> LAPWSO END
> LAPWSO END
> [2] Done ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .timeso_$loop
> [1] + Done ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .timeso_$loop
> LAPW2 - FERMI; weights written
> LAPW2 END
> LAPW2 END
> [2] Done ( cd $PWD; $t $ttt $vector_split; rm -f .lock_$lockfile[$p] ) >> .time2_$loop
> [1] + Done ( cd $PWD; $t $ttt $vector_split; rm -f .lock_$lockfile[$p] ) >> .time2_$loop
> SUMPARA END
> CORE END
> MIXER END
> ec cc and fc_conv 0 0 1
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
More information about the Wien
mailing list