[Wien] Problems of out of memory in parallel jobs

Laurence Marks laurence.marks at gmail.com
Tue May 12 01:37:00 CEST 2020


I suggest that you talk to a sysadmin to get some clarification. In
particular, see if this is just memory, or a combination of memory and file
space. From what I can see it is probably memory, but there seems to be
some flexibility in how it is configured.

One other possibility is a memory leak. What mpi are you using?

N.B., I would be a bit concerned that srun is not working for you. Talk to
a sysadmin, you might be running outside/around your memory allocation.

Two relevant sources:
https://community.pivotal.io/s/article/the-application-crashes-with-the-message-cgroup-out-of-memory?language=en_US

https://bugs.schedmd.com/show_bug.cgi?id=2614

On Mon, May 11, 2020 at 10:55 AM MA Weiliang <weiliang.MA at etu.univ-amu.fr>
wrote:

> Dear Wien users,
>
> The wien2k 18.2 I used is compiled in a share memory cluster with intel
> compiler 2019, mkl 2019 and impi 2019.  Because ‘srun' cannot get a correct
> parallel calculation in the system, I commented the line of "#setenv
> WIEN_MPIRUN "srun -K -N_nodes_ -n_NP_ -r_offset_ _PINNING_ _EXEC_” in the
> parallel_options file and used the second choice "mpirun='mpirun -np _NP_
> _EXEC_”.
>
> Parallel jobs go well in scf cycles. But when I increase k points (about
> 5000) to calculate DOS, the lapw1 crashed with the cgroup out-of-memory
> handler halfway. That is very strange. With same parameters, job runs well
> with single core.
>
> The similar problem is encountered on nlvdw_mpi step. I also increase
> memory up to 50G for this less than 10 atoms cell, but it still didn’t work.
>
> *[Parallel job output:]*
> *starting parallel lapw1 at lun. mai 11 16:24:48 CEST 2020*
> *->  starting parallel LAPW1 jobs at lun. mai 11 16:24:48 CEST 2020*
> *running LAPW1 in parallel mode (using .machines)*
> *1 number_of_parallel_jobs*
> *[1] 12604*
> *[1]  + Done                          ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop*
> *     lame25 lame25 lame25 lame25 lame25 lame25 lame25 lame25(5038)
> 4641.609u 123.862s 10:00.69 793.3%   0+0k 489064+2505080io 7642pf+0w*
> *   Summary of lapw1para:*
> *   lame25        k=0     user=0  wallclock=0*
> ***  LAPW1 crashed!*
> *4643.674u 126.539s 10:03.50 790.4%      0+0k 490512+2507712io 7658pf+0w*
> *error: command   /home/mcsete/work/wma/Package/wien2k.18n/lapw1para
> lapw1.def   failed*
> *slurmstepd: error: Detected 1 oom-kill event(s) in step 86112.batch
> cgroup. Some of your processes may have been killed by the cgroup
> out-of-memory handler.*
>
> *[Single mode output: ]*
> * LAPW1 END*
> *11651.205u 178.664s 3:23:49.07 96.7%    0+0k 19808+22433688io 26pf+0w*
>
> Do you have any ideas? Thank you in advance!
>
> Best regards,
> Liang
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!FzOfv9gey6e2nE6BL116Cgoy1UpRBalprajLfQ67QqwAytt0uPXPCtFoozTIGaBYJjay5Q$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!FzOfv9gey6e2nE6BL116Cgoy1UpRBalprajLfQ67QqwAytt0uPXPCtFoozTIGaAW3twYHA$
>


-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: www.numis.northwestern.edu/MURI
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20200511/33f9e5f9/attachment.html>


More information about the Wien mailing list