[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc

Fri Jan 24 17:50:38 CET 2025

There is a list of potential exit code 9 (KILLED BY SIGNAL: 9) causes at 
[1].

Hitting the walltime (--time [2,3]) limit is listed as one of them.

The slurm seff command might be helpful for determining if it caused by 
oom.  Refer to [4,5].

[1] 
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-6/error-message-bad-termination.html
[2] https://docs.hpc.uwec.edu/slurm/determining-resources/#time-walltime
[3] https://hpcc.umd.edu/hpcc/help/jobs.html#walltime
[4] https://www.nsc.liu.se/support/memory-management/
[5] https://documentation.sigma2.no/jobs/choosing-memory-settings.html

Hope that can help,
Gavin
WIEN2k user

On 1/24/2025 8:40 AM, Laurence Marks wrote:
> Sorry, but you have not provided enough information for more than a guess.
>
> Exit code 9 is when the OS kills the task, often from out of memory 
> (oom) but it does not have to be. The larger calculation will require 
> about 8*8 more memory (perhaps more) than your simple calculation: do 
> "grep "Matrix size" *output1* -18". You probably ran out of memory, 
> and will need to use more mpi/kpt for the larger calculation.
>
> N.B., using 2 ompi per task is also useful in reducing the total 
> memory useage. Combine this with mpi.
>
>
> ---
> Emeritus Professor Laurence Marks (Laurie)
> www.numis.northwestern.edu <http://www.numis.northwestern.edu>
> https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en 
> <https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
> "Research is to see what everybody else has seen, and to think what 
> nobody else has thought" Albert Szent-Györgyi
>
> On Fri, Jan 24, 2025, 07:46 Sergeev Gregory <sgregory at live.ru> wrote:
>
>     Dear developers,
>     I do my calculations on hpc with slurm system and I have strange
>     behaviour of parallel wien2k jobs:
>
>     I have two structures:
>     1. Structure with 8 atoms in unitcell (simple structure)
>     2. Supercell structure with 64 atoms (2*2*2 supercell structure)
>     based on cell from simple structure
>
>     I try to do Wien2k calculations on parallel mode with two configs:
>     1. Calculations on 1 node (1 node has 48 processors) with 12
>     parallel jobs with 4 processors per each job (one node job)
>     2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24
>     parallel jobs with 4 processors per each job (two node job)
>
>     For "simple structure" "one node job" and "two node job" work
>     without problems.
>
>     For "supercell structure" "one node job" works well, but "two node
>     job" crashs with errors in .time1_* files (I use Intel MPI):
>
>     -----------------
>     n053 n053 n053 n053(21)
>     ===================================================================================
>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>     =   PID 21859 RUNNING AT n053
>     =   EXIT CODE: 9
>     =   CLEANING UP REMAINING PROCESSES
>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>     ===================================================================================
>
>     ===================================================================================
>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>     =   PID 21859 RUNNING AT n053
>     =   EXIT CODE: 9
>     =   CLEANING UP REMAINING PROCESSES
>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>     ===================================================================================
>        Intel(R) MPI Library troubleshooting guide:
>     https://software.intel.com/node/561764
>     ===================================================================================
>     0.042u 0.144s 2:45.42 0.1%    0+0k 4064+8io 60pf+0w
>     -----------------
>
>     First I thinked, that there are problems with unufficial memory on
>     "2 node job" (but why, if "1 node job" works with same processors
>     per one parallel job?). I tried to twice increaced used memory per
>     task (#SBATCH --cpus-per-task 2), but this fix haven't solve
>     problem. Same error.
>
>     Any ideas why such strange behavior?
>     Does Wien2k have problems scaling to multiple nodes?
>
>     I would appreciate your help. I want to speed up calculations for
>     complex structures, I have the resources, but I can't do it.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20250124/379e5019/attachment-0001.htm>