[Wien] Problem with k-parallel in version 24.1?

Peter Blaha peter.blaha at tuwien.ac.at
Wed Oct 16 09:01:34 CEST 2024


Yes, this is still a problem with overloading, for instance when it 
happens accidentally that very many programs would read/write exactly at 
the same time, your system may not have enough clients to handle this.

A much better setup is to use OMP_NUM_THREADS 2 or 4 (try out both and 
check i) stability ii) timing) and use only 32 (or 16) k-parallel jobs.

In addition, the lapw1/2/..para scripts have some delays (search for 
DELAY and SLEEPY), which could be increased. Not sure if it helps in 
your case.

PS: Previously, OMP with 2 cores was very efficient, but more was not 
good. Recently this seems to have changed and eventually 4 (or even 8 
cores) are not that bad and quite efficient.

Am 16.10.2024 um 00:49 schrieb Yichen Zhang:
> Dear WIEN2k developers and users,
> 
> I'm running WIEN2k 24.1 on a SLURM cluster. In the case here, only k- 
> parallel is used (no omp or mpi). I typically divided klist into 64 
> groups onto 64 cores for this set of calculations. Hyperthreading is 
> turned off.
> 
> I encountered this error from time to time. Sometimes all SCF cycles 
> just finish successfully, but there is maybe a 20-40% chance that the 
> SCF stops at sumpara at one cycle after lapw2. Restarting the SCF may 
> just work fine until convergence or encounter this problem again at one 
> cycle. Sometimes the error just doesn't pop up. The error comes from 
> file case.scf2up/dn_XX not found. XX being between, for example 1 and 
> 64, if 64 k-point parallel procedures.
> 
> One example of such error in slurm standard output is:
> 
> forrtl: No such file or directory
> 
> forrtl: severe (29): file not found, unit 21, file /scratch/yz155/ 
> UUD_U6p25eV/UUD_U6p25eV.scf2dn_62
> 
> ImagePCRoutineLineSource
> 
> sumpara000000000042876CUnknown UnknownUnknown
> 
> sumpara000000000041303Ascfsum_ 128scfsum.f
> 
> sumpara0000000000410F92MAIN__242sumpara.F
> 
> sumpara000000000040434DUnknown UnknownUnknown
> 
> libc.so.6000014D975829590Unknown UnknownUnknown
> 
> libc.so.6000014D975829640__libc_start_main UnknownUnknown
> 
> sumpara0000000000404265Unknown UnknownUnknown
> 
> cp: cannot stat '.in.tmp': No such file or directory
> 
> grep: No match.
> 
> 
>  > stop error
> 
> 
> The missing scf2 file sometimes comes from scf2up or sometimes from 
> scf2dn. The "62" seems random among k-parallel numbers.
> 
> 
> I noticed a previous thread in 2016 when Maciej Polak asked about 
> "Problem with k-parallel", but I guess much has been updated since then.
> 
> 
> Does it still come from slow I/O? I already run it in /scratch on the 
> cluster which has the fastest I/O. What are some insights and 
> suggestions? Thank you very much in advance.
> 
> 
> Best regards
> 
> Yichen
> 
> 
> -- 
> Yichen Zhang
> Department of Physics and Astronomy
> Rice University
> 6100 Main St., Houston, TX 77005-1892
> Email: zycforphysics at gmail.com <mailto:zycforphysics at gmail.com>
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
-----------------------------------------------------------------------
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------



More information about the Wien mailing list