[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc

Laurence Marks laurence.marks at gmail.com
Fri Jan 24 16:40:44 CET 2025


Sorry, but you have not provided enough information for more than a guess.

Exit code 9 is when the OS kills the task, often from out of memory (oom)
buy it does not have to be. The larger calculation will require about 8*8
more memory (perhaps more) than your simple calculation: do "grep "Matrix
size" *output1* -18". You probably ran out of memory, and will need to use
more mpi/kpt for the larger calculation.

N.B., using 2 ompi per task is also useful in reducing the total memory
useage. Combine this with mpi.


---
Emeritus Professor Laurence Marks (Laurie)
www.numis.northwestern.edu
https://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en
"Research is to see what everybody else has seen, and to think what nobody
else has thought" Albert Szent-Györgyi

On Fri, Jan 24, 2025, 07:46 Sergeev Gregory <sgregory at live.ru> wrote:

> Dear developers,
> I do my calculations on hpc with slurm system and I have strange behaviour
> of parallel wien2k jobs:
>
> I have two structures:
> 1. Structure with 8 atoms in unitcell (simple structure)
> 2. Supercell structure with 64 atoms (2*2*2 supercell structure) based on
> cell from simple structure
>
> I try to do Wien2k calculations on parallel mode with two configs:
> 1. Calculations on 1 node (1 node has 48 processors) with 12 parallel jobs
> with 4 processors per each job (one node job)
> 2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24
> parallel jobs with 4 processors per each job (two node job)
>
> For "simple structure" "one node job" and "two node job" work without
> problems.
>
> For "supercell structure" "one node job" works well, but "two node job"
> crashs with errors in .time1_* files (I use Intel MPI):
>
> -----------------
> n053 n053 n053 n053(21)
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 21859 RUNNING AT n053
> =   EXIT CODE: 9
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 21859 RUNNING AT n053
> =   EXIT CODE: 9
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>    Intel(R) MPI Library troubleshooting guide:
>       https://software.intel.com/node/561764
>
> ===================================================================================
> 0.042u 0.144s 2:45.42 0.1% 0+0k 4064+8io 60pf+0w
> -----------------
>
> First I thinked, that there are problems with unufficial memory on "2 node
> job" (but why, if "1 node job" works with same processors per one parallel
> job?). I tried to twice increaced used memory per task (#SBATCH
> --cpus-per-task 2), but this fix haven't solve problem. Same error.
>
> Any ideas why such strange behavior?
> Does Wien2k have problems scaling to multiple nodes?
>
> I would appreciate your help. I want to speed up calculations for complex
> structures, I have the resources, but I can't do it.
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20250124/35e279a2/attachment.htm>


More information about the Wien mailing list