[Wien] Problem with k-parallel in version 24.1?
Peter Blaha
peter.blaha at tuwien.ac.at
Wed Oct 16 09:01:34 CEST 2024
Yes, this is still a problem with overloading, for instance when it
happens accidentally that very many programs would read/write exactly at
the same time, your system may not have enough clients to handle this.
A much better setup is to use OMP_NUM_THREADS 2 or 4 (try out both and
check i) stability ii) timing) and use only 32 (or 16) k-parallel jobs.
In addition, the lapw1/2/..para scripts have some delays (search for
DELAY and SLEEPY), which could be increased. Not sure if it helps in
your case.
PS: Previously, OMP with 2 cores was very efficient, but more was not
good. Recently this seems to have changed and eventually 4 (or even 8
cores) are not that bad and quite efficient.
Am 16.10.2024 um 00:49 schrieb Yichen Zhang:
> Dear WIEN2k developers and users,
>
> I'm running WIEN2k 24.1 on a SLURM cluster. In the case here, only k-
> parallel is used (no omp or mpi). I typically divided klist into 64
> groups onto 64 cores for this set of calculations. Hyperthreading is
> turned off.
>
> I encountered this error from time to time. Sometimes all SCF cycles
> just finish successfully, but there is maybe a 20-40% chance that the
> SCF stops at sumpara at one cycle after lapw2. Restarting the SCF may
> just work fine until convergence or encounter this problem again at one
> cycle. Sometimes the error just doesn't pop up. The error comes from
> file case.scf2up/dn_XX not found. XX being between, for example 1 and
> 64, if 64 k-point parallel procedures.
>
> One example of such error in slurm standard output is:
>
> forrtl: No such file or directory
>
> forrtl: severe (29): file not found, unit 21, file /scratch/yz155/
> UUD_U6p25eV/UUD_U6p25eV.scf2dn_62
>
> ImagePCRoutineLineSource
>
> sumpara000000000042876CUnknown UnknownUnknown
>
> sumpara000000000041303Ascfsum_ 128scfsum.f
>
> sumpara0000000000410F92MAIN__242sumpara.F
>
> sumpara000000000040434DUnknown UnknownUnknown
>
> libc.so.6000014D975829590Unknown UnknownUnknown
>
> libc.so.6000014D975829640__libc_start_main UnknownUnknown
>
> sumpara0000000000404265Unknown UnknownUnknown
>
> cp: cannot stat '.in.tmp': No such file or directory
>
> grep: No match.
>
>
> > stop error
>
>
> The missing scf2 file sometimes comes from scf2up or sometimes from
> scf2dn. The "62" seems random among k-parallel numbers.
>
>
> I noticed a previous thread in 2016 when Maciej Polak asked about
> "Problem with k-parallel", but I guess much has been updated since then.
>
>
> Does it still come from slow I/O? I already run it in /scratch on the
> cluster which has the fastest I/O. What are some insights and
> suggestions? Thank you very much in advance.
>
>
> Best regards
>
> Yichen
>
>
> --
> Yichen Zhang
> Department of Physics and Astronomy
> Rice University
> 6100 Main St., Houston, TX 77005-1892
> Email: zycforphysics at gmail.com <mailto:zycforphysics at gmail.com>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
-----------------------------------------------------------------------
Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------
More information about the Wien
mailing list