[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc

Peter Blaha peter.blaha at tuwien.ac.at
Fri Jan 24 18:52:36 CET 2025


Check
$WIENROOT/WIEN2k_parallel_options

setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"

Is your MPI_REMOTE set to zero or one ?
and     USE_REMOTE  ??

Can you do k-parallel only (no mpi) on 2 nodes ?

You did not show the   .machines file. Is it ok ?

And maybe the beginning of your job scipt, maybe some slurm parameters 
are not set properly for the 2 node job ?

Am 24.01.2025 um 14:36 schrieb Sergeev Gregory:
> Dear developers,
> I do my calculations on hpc with slurm system and I have strange 
> behaviour of parallel wien2k jobs:
> 
> I have two structures:
> 1. Structure with 8 atoms in unitcell (simple structure)
> 2. Supercell structure with 64 atoms (2*2*2 supercell structure) based 
> on cell from simple structure
> 
> I try to do Wien2k calculations on parallel mode with two configs:
> 1. Calculations on 1 node (1 node has 48 processors) with 12 parallel 
> jobs with 4 processors per each job (one node job)
> 2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24 
> parallel jobs with 4 processors per each job (two node job)
> 
> For "simple structure" "one node job" and "two node job" work without 
> problems.
> 
> For "supercell structure" "one node job" works well, but "two node job" 
> crashs with errors in .time1_* files (I use Intel MPI):
> 
> -----------------
> n053 n053 n053 n053(21)
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 21859 RUNNING AT n053
> =   EXIT CODE: 9
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> 
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 21859 RUNNING AT n053
> =   EXIT CODE: 9
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>     Intel(R) MPI Library troubleshooting guide:
>        https://software.intel.com/node/561764
> ===================================================================================
> 0.042u 0.144s 2:45.42 0.1%    0+0k 4064+8io 60pf+0w
> -----------------
> 
> First I thinked, that there are problems with unufficial memory on "2 
> node job" (but why, if "1 node job" works with same processors per one 
> parallel job?). I tried to twice increaced used memory per task (#SBATCH 
> --cpus-per-task 2), but this fix haven't solve problem. Same error.
> 
> Any ideas why such strange behavior?
> Does Wien2k have problems scaling to multiple nodes?
> 
> I would appreciate your help. I want to speed up calculations for 
> complex structures, I have the resources, but I can't do it.
> 
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
-----------------------------------------------------------------------
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------



More information about the Wien mailing list