[Wien] speedup with mpi

Fri Nov 15 12:57:53 CET 2024

Dear Prof. Blaha,

Thank you for quick comments!

My recent case has 50 atoms, and 169 k-point klist in the SCF. But 
normally I have less atoms. I am not sure if I can reduce the klist, I 
am interested in SOC calculations and band structure -- this can be 
tested of course.

Assuming a single k-point: how much faster mpi on the cluster can really 
be (assuming perfect bandwidth)?

Would it be e.g. 20x faster for 20 atoms slab? Does it correlate with 
number of atoms in the unit cell?

Will mpi speed scale like OMP on a single node (until the RAM and 
inter-node bandwidth limit)?

Is it in general true that for WIEN2k the mpi speedup is roughly the 
same like OMP, but just allows distributing jobs to various nodes (thus 
fixing the RAM bandwidth issue)?

Yes, I do use SCRATCH directory on the cluster.

Best,
Lukasz

On 2024-11-15 10:22, Peter Blaha wrote:
>> Does anyone has experience in running WIEN2k on a cluster or 
>> supercomputer with both k-parallel and mpi? My interest is in band 
>> structure calculations of large slabs (e.g. >20 atoms). I would 
>> appreciate any comment/remark.
> 
> What you call "large slab" (20 atoms), I'd call "small calculation".
> 
> What makes your calculations expensive is the tremendous number of
> k-points you are using in general. Do you do the scf cycle also with
> these huge number of k-points ?
> 
> On a single node, with limited memory bandwidth, mpi will NOT help. It
> suffers from the same memory bandwidth limit.
> For your "small" calculations you can probably use a few nodes on a
> supercomputer (certainly not too many) and couple them via mpi to get
> a single k-point done a bit faster. This depends on the hardware
> (network speed and I/O ) and again memory bandwidth. Note: ELPA is
> mandatory for fast mpi-calculations !!
> 
> At http://www.wien2k.at/reg_user/benchmark/ you can find my benchmarks
> for the I9-14900K. It shows the identical limitations due to memory.
> In fact I use often OMP=8 and only 1 k-job on such a machines (or OMP4
> and 2-k parallel - not much difference). Therefore my recommendations
> for new PCs would be to use a processor with less cores, but maybe buy
> more of them and couple them for k-parallel.
> 
> The memory bandwidth problem is related to all "linear algebra" tasks,
> i.e. the matrix diagonalization.
> 
> For sure, a "good supercomputer" should give you overall a better
> performance, but for such "small cases", don't expect too much. While
> the memory bandwidth is often less problematic with Xeon type (or AMD)
> cores, most supercomputers suffer either from network or I/O
> limitations. And k-parallel jobs are quite I/O intensive (I hope you
> ALWAYS use a local SCRATCH directory ?).
> Also note: the single core performance is usually SLOWER than what you
> can get on an I9-14900K PC.
> 
> The real benefit of a supercomputer + mpi is its "unlimited" memory
> (and sometimes, that it does not cost you any real money). You can do
> unit cells with several hundreds of atoms ....
> 
> The passwordless ssh should not be a problem on a reasonable slurm
> machine - the problem is the slow network leading to timeouts in ssh
> connections....
> 
>> - I am wondering if there is a realistic speedup when using mpi? Can I 
>> have e.g. 10x speedup only from mpi, compared to single core? Will 
>> speedup then multiply with k-parallel?
>> 
>> - Does mpi on a single node also suffer from the memory bandwidth 
>> (related to the number of memory channels on the chipset/mainboard)?
>> 
>> - Has anyone been able to find a workaround for the passwordless ssh 
>> for running on a cluster/supercomputer?
>> 
>> - Is the memory bandwidth problem intrinsic to LAPW, or specific to 
>> WIEN2k? With two memory channels on desktop machines, the k-parallel 
>> speedup is only up to something like 4 cores (depending on OMP a bit). 
>> Actually WIEN2k speed does increase quite a lot with the RAM speed 
>> (e.g. DDR5 7200) -- this is the case on i9-14900 that I am using in 
>> the office.
>> 
>> For few years I have been using an older slurm cluster, and jobs 
>> typically crash when using more that 8 or 10 nodes with k-parallel 
>> (also with small OMP, I think this has been discussed in the mailing 
>> list). In my case each node has 8 cores and 4 memory channels, so it 
>> can do 8 k- parallel jobs with practically linear speedup. I would say 
>> that in general extra speedup with OMP is not significant, and the 
>> real speedup is only with k-parallel (but as motioned it is limited by 
>> the effective number of nodes, probably because slurm does not like 
>> too many passwordless ssh connections).
>> 
>> I am asking because I am wondering if an effort to setup WIEN2k on a 
>> supercomputer makes any sense at all. Having a bit faster single core 
>> might not be worth the effort.
>> 
>> Best,
>> Lukasz
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/ 
>> wien at zeus.theochem.tuwien.ac.at/index.html