[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc

Sergeev Gregory sgregory at live.ru
Fri Jan 24 14:36:29 CET 2025


Dear developers,
I do my calculations on hpc with slurm system and I have strange behaviour of parallel wien2k jobs:

I have two structures:
1. Structure with 8 atoms in unitcell (simple structure)
2. Supercell structure with 64 atoms (2*2*2 supercell structure) based on cell from simple structure

I try to do Wien2k calculations on parallel mode with two configs:
1. Calculations on 1 node (1 node has 48 processors) with 12 parallel jobs with 4 processors per each job (one node job)
2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24 parallel jobs with 4 processors per each job (two node job)

For "simple structure" "one node job" and "two node job" work without problems.

For "supercell structure" "one node job" works well, but "two node job" crashs with errors in .time1_* files (I use Intel MPI):

-----------------
n053 n053 n053 n053(21)
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 21859 RUNNING AT n053
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 21859 RUNNING AT n053
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
0.042u 0.144s 2:45.42 0.1%    0+0k 4064+8io 60pf+0w
-----------------

First I thinked, that there are problems with unufficial memory on "2 node job" (but why, if "1 node job" works with same processors per one parallel job?). I tried to twice increaced used memory per task (#SBATCH --cpus-per-task 2), but this fix haven't solve problem. Same error.

Any ideas why such strange behavior?
Does Wien2k have problems scaling to multiple nodes?

I would appreciate your help. I want to speed up calculations for complex structures, I have the resources, but I can't do it.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20250124/255b2d00/attachment.htm>


More information about the Wien mailing list