[Wien] Problem with parallel jobs of comlex structures (supercells) on hpc
Peter Blaha
peter.blaha at tuwien.ac.at
Fri Jan 24 18:52:36 CET 2025
Check
$WIENROOT/WIEN2k_parallel_options
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
Is your MPI_REMOTE set to zero or one ?
and USE_REMOTE ??
Can you do k-parallel only (no mpi) on 2 nodes ?
You did not show the .machines file. Is it ok ?
And maybe the beginning of your job scipt, maybe some slurm parameters
are not set properly for the 2 node job ?
Am 24.01.2025 um 14:36 schrieb Sergeev Gregory:
> Dear developers,
> I do my calculations on hpc with slurm system and I have strange
> behaviour of parallel wien2k jobs:
>
> I have two structures:
> 1. Structure with 8 atoms in unitcell (simple structure)
> 2. Supercell structure with 64 atoms (2*2*2 supercell structure) based
> on cell from simple structure
>
> I try to do Wien2k calculations on parallel mode with two configs:
> 1. Calculations on 1 node (1 node has 48 processors) with 12 parallel
> jobs with 4 processors per each job (one node job)
> 2. Calculations on 2 nodes (2 node has 48*2=96 processors) with 24
> parallel jobs with 4 processors per each job (two node job)
>
> For "simple structure" "one node job" and "two node job" work without
> problems.
>
> For "supercell structure" "one node job" works well, but "two node job"
> crashs with errors in .time1_* files (I use Intel MPI):
>
> -----------------
> n053 n053 n053 n053(21)
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 21859 RUNNING AT n053
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 21859 RUNNING AT n053
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> Intel(R) MPI Library troubleshooting guide:
> https://software.intel.com/node/561764
> ===================================================================================
> 0.042u 0.144s 2:45.42 0.1% 0+0k 4064+8io 60pf+0w
> -----------------
>
> First I thinked, that there are problems with unufficial memory on "2
> node job" (but why, if "1 node job" works with same processors per one
> parallel job?). I tried to twice increaced used memory per task (#SBATCH
> --cpus-per-task 2), but this fix haven't solve problem. Same error.
>
> Any ideas why such strange behavior?
> Does Wien2k have problems scaling to multiple nodes?
>
> I would appreciate your help. I want to speed up calculations for
> complex structures, I have the resources, but I can't do it.
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
-----------------------------------------------------------------------
Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------
More information about the Wien
mailing list