[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc
Gavin Abo
gabo13279 at gmail.com
Fri Jan 24 17:50:38 CET 2025
There is a list of potential exit code 9 (KILLED BY SIGNAL: 9) causes at
[1].
Hitting the walltime (--time [2,3]) limit is listed as one of them.
The slurm seff command might be helpful for determining if it caused by
oom. Refer to [4,5].
[1]
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-6/error-message-bad-termination.html
[2] https://docs.hpc.uwec.edu/slurm/determining-resources/#time-walltime
[3] https://hpcc.umd.edu/hpcc/help/jobs.html#walltime
[4] https://www.nsc.liu.se/support/memory-management/
[5] https://documentation.sigma2.no/jobs/choosing-memory-settings.html
Hope that can help,
Gavin
WIEN2k user
On 1/24/2025 8:40 AM, Laurence Marks wrote:
> Sorry, but you have not provided enough information for more than a guess.
>
> Exit code 9 is when the OS kills the task, often from out of memory
> (oom) but it does not have to be. The larger calculation will require
> about 8*8 more memory (perhaps more) than your simple calculation: do
> "grep "Matrix size" *output1* -18". You probably ran out of memory,
> and will need to use more mpi/kpt for the larger calculation.
>
> N.B., using 2 ompi per task is also useful in reducing the total
> memory useage. Combine this with mpi.
>
>
> ---
> Emeritus Professor Laurence Marks (Laurie)
> www.numis.northwestern.edu <http://www.numis.northwestern.edu>
> https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en
> <https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought" Albert Szent-Györgyi
>
> On Fri, Jan 24, 2025, 07:46 Sergeev Gregory <sgregory at live.ru> wrote:
>
> Dear developers,
> I do my calculations on hpc with slurm system and I have strange
> behaviour of parallel wien2k jobs:
>
> I have two structures:
> 1. Structure with 8 atoms in unitcell (simple structure)
> 2. Supercell structure with 64 atoms (2*2*2 supercell structure)
> based on cell from simple structure
>
> I try to do Wien2k calculations on parallel mode with two configs:
> 1. Calculations on 1 node (1 node has 48 processors) with 12
> parallel jobs with 4 processors per each job (one node job)
> 2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24
> parallel jobs with 4 processors per each job (two node job)
>
> For "simple structure" "one node job" and "two node job" work
> without problems.
>
> For "supercell structure" "one node job" works well, but "two node
> job" crashs with errors in .time1_* files (I use Intel MPI):
>
> -----------------
> n053 n053 n053 n053(21)
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 21859 RUNNING AT n053
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 21859 RUNNING AT n053
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> Intel(R) MPI Library troubleshooting guide:
> https://software.intel.com/node/561764
> ===================================================================================
> 0.042u 0.144s 2:45.42 0.1% 0+0k 4064+8io 60pf+0w
> -----------------
>
> First I thinked, that there are problems with unufficial memory on
> "2 node job" (but why, if "1 node job" works with same processors
> per one parallel job?). I tried to twice increaced used memory per
> task (#SBATCH --cpus-per-task 2), but this fix haven't solve
> problem. Same error.
>
> Any ideas why such strange behavior?
> Does Wien2k have problems scaling to multiple nodes?
>
> I would appreciate your help. I want to speed up calculations for
> complex structures, I have the resources, but I can't do it.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20250124/379e5019/attachment-0001.htm>
More information about the Wien
mailing list