[Wien] Problems of out of memory in parallel jobs

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue May 12 08:43:59 CEST 2020


I've also seen previously this memory accumulation in lapw1 when more 
than 1 k-point is used (but in large scale computations with 100 or more 
atoms).
Most likely this is connected to the well documented (discussed in the 
mailing list) memory leak of intel-mpi (until very recent versions). 
version 2019 is not enough information.

i) Upgrade to the newest impi + ifort

2) Why are you using mpi-parallelism for a small 10 atoms cell and 5000 
k-points ?

3) Upgrade to WIEN2k_19.2 and use OMP-parallelism (4 cores) + 
k-parallelism (thne you can stay with your old compilers).

On 5/11/20 5:55 PM, MA Weiliang wrote:
> Dear Wien users,
> 
> The wien2k 18.2 I used is compiled in a share memory cluster with intel 
> compiler 2019, mkl 2019 and impi 2019.  Because ‘srun' cannot get a 
> correct parallel calculation in the system, I commented the line of 
> "#setenv WIEN_MPIRUN "srun -K -N_nodes_ -n_NP_ -r_offset_ _PINNING_ 
> _EXEC_” in the parallel_options file and used the second choice 
> "mpirun='mpirun -np _NP_ _EXEC_”.
> 
> Parallel jobs go well in scf cycles. But when I increase k points (about 
> 5000) to calculate DOS, the lapw1 crashed with the cgroup out-of-memory 
> handler halfway. That is very strange. With same parameters, job runs 
> well with single core.
> 
> The similar problem is encountered on nlvdw_mpi step. I also increase 
> memory up to 50G for this less than 10 atoms cell, but it still didn’t work.
> 
> *[Parallel job output:]*
> _starting parallel lapw1 at lun. mai 11 16:24:48 CEST 2020_
> _->  starting parallel LAPW1 jobs at lun. mai 11 16:24:48 CEST 2020_
> _running LAPW1 in parallel mode (using .machines)_
> _1 number_of_parallel_jobs_
> _[1] 12604_
> _[1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop_
> _     lame25 lame25 lame25 lame25 lame25 lame25 lame25 lame25(5038) 
> 4641.609u 123.862s 10:00.69 793.3%   0+0k 489064+2505080io 7642pf+0w_
> _   Summary of lapw1para:_
> _   lame25        k=0     user=0  wallclock=0_
> _**  LAPW1 crashed!_
> _4643.674u 126.539s 10:03.50 790.4%      0+0k 490512+2507712io 7658pf+0w_
> _error: command   /home/mcsete/work/wma/Package/wien2k.18n/lapw1para 
> lapw1.def   failed_
> _*slurmstepd: error: Detected 1 oom-kill event(s) in step 86112.batch 
> cgroup. Some of your processes may have been killed by the cgroup 
> out-of-memory handler.*_
> 
> *[Single mode output: ]*
> _ LAPW1 END_
> _11651.205u 178.664s 3:23:49.07 96.7%    0+0k 19808+22433688io 26pf+0w_
> 
> Do you have any ideas? Thank you in advance!
> 
> Best regards,
> Liang
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------


More information about the Wien mailing list