[Wien] speedup with mpi

Fri Nov 15 15:06:13 CET 2024

Look at the benchmark page mentioned in my last email:

MPI-parallel benchmark: NMAT=11571, real, full diagonalization

Intel Core I9-14900k (8 cores at 6.0 +12 cores at 3.4GHz), oneapi-2021.1.1 
(wall times)

serial code, OMP=8:
mpi-benchmark.output1:       TIME HAMILT (CPU)  =    28.8, HNS = 
44.5, HORB =     0.0, DIAG =    83.7
mpi-benchmark.output1:       TIME HAMILT (WALL) =     3.7, HNS = 
5.6, HORB =     0.0, DIAG =    19.8
                            > SUM OF WALL CLOCK TIMES:     29.4 (INIT = 
     0.3 + K-POINTS =     29.1)
mpi-code (Elpa) OMP=1, 8 mpi-jobs:
mpi-benchmark.output1_1:       TIME HAMILT (CPU)  =     4.3, HNS = 
3.3, HORB =     0.0, DIAG =    42.7
mpi-benchmark.output1_1:       TIME HAMILT (WALL) =     4.3, HNS = 
3.3, HORB =     0.0, DIAG =    42.7
                            > SUM OF WALL CLOCK TIMES:     51.0 (INIT = 
     0.3 + K-POINTS =     50.6)

The crucial quantity is not the number of atoms, but the matrix size (x 
lapw1 -nmat_only), they are correlated but depending on the type of 
atoms (s,p,d,f-atoms), the crystal structure (open-or closed packed; 
nn-distances --> RMTs) and the desired precision there could easily be a 
factor of 10 or more in the matrix size (for 2 cases with same number of 
atoms).
Our mpi-benchmark has NMAT=11500. On a single I9-14900K  the OMP version 
is almost a factor of 2 faster than 8 mpi. Why ? Check the partial timings:
Hamilt: the difference between OMP and mpi is small, both scale well
HNS:    identical time
DIAG: mpi is 2 x slower. This is not due to bad scaling, but the 
parallel diagonalization algorithm is more than twice slower than the 
sequential one.

On a supercomputer you usually have:  more cores/node  and many more nodes.
OMP: I have no recent benchmark, but would expect that the 
OMP-parallelization on a single node does NOT scale particularly well if 
you have many more cores, but this may be "mkl-version" dependent (I 
always said OMP=2 is good, 4 maybe acceptable, 8 useless. But with the 
latest onepi versions: OMP=4 still good, 8 is acceptable.)
mpi: The diagonalization algorithm should be much better parallelizable, 
thus I would expect it scales MUCH better with more cores (still memory 
bound) or nodes (little degradation on a fast network).
So this NMAT=11000 case should definitely run fine on 64 cores (8 times 
faster than on 8), maybe even 128 or 255 give still some speedup. 
However, there is a limit in the number of cores/matrixsize.

NMAT / sqrt(n-cores) > ~1000 (the exact number you have to test for your 
hardware, and then it depends what degradation you still accept - I've 
seen people running on 128 instead of 64 cores, because it is 10% faster 
(instead of the desired 100%) - and I've seen people running the same 
case on 512 cores and it was 30% SLOWER ! (they would still use 512 
cores ...)).
This is because the large matrix gets decomposed into n x n submatrices 
and the dimension of these submatrices should not go below 1000.

So in essence: I cannot give you a definite answer to your questions. 
You have to find out yourself on the specific hardware and the specific 
case.

PS:
> My recent case has 50 atoms, and 169 k-point klist in the SCF. But 
> normally I have less atoms. I am not sure if I can reduce the klist, I 
> am interested in SOC calculations and band structure -- this can be 
> tested of course.

Again: from just 50 atoms and 169k-points: I cannot judge this:  It 
depends on metal/nonmetal, symmetry (still cubic (or tetragonal with 
SOC) or P1 without any symmetry ? "Nasty fermi surface or not ?

The ONLY way to do this is:
init; init_so
run[sp] -so      with a smaller k-mesh
save k1
x kgen (better k-mesh)
run...

Your k-mesh was converged, it should stop after 3 iterations (because 
the density did not change), if not (and :DIS was big in the first 
cycles)  ---> continue later on with an even denser mesh.

Very often, the density (potential) converges with a small number of 
k-points, and only for a DOS or bandstructure you use a denser mesh.

PPS: The k-mesh should always "correlate" with other parameters of such 
calculations like RKMAX, GMAX, FFT-grid, HDLOs, L-VNS, ...
I've tried to balance this with the -prec 0-3(n) option in init_lapw, 
but of course this can only be an "averaged" recommendation.
> 
> Assuming a single k-point: how much faster mpi on the cluster can really 
> be (assuming perfect bandwidth)?
> 
> Would it be e.g. 20x faster for 20 atoms slab? Does it correlate with 
> number of atoms in the unit cell?
> 
> Will mpi speed scale like OMP on a single node (until the RAM and inter- 
> node bandwidth limit)?
> 
> Is it in general true that for WIEN2k the mpi speedup is roughly the 
> same like OMP, but just allows distributing jobs to various nodes (thus 
> fixing the RAM bandwidth issue)?
> 
> Yes, I do use SCRATCH directory on the cluster.
> 
> Best,
> Lukasz
> 
> 
> 
> 
> 
> On 2024-11-15 10:22, Peter Blaha wrote:
>>> Does anyone has experience in running WIEN2k on a cluster or 
>>> supercomputer with both k-parallel and mpi? My interest is in band 
>>> structure calculations of large slabs (e.g. >20 atoms). I would 
>>> appreciate any comment/remark.
>>
>> What you call "large slab" (20 atoms), I'd call "small calculation".
>>
>> What makes your calculations expensive is the tremendous number of
>> k-points you are using in general. Do you do the scf cycle also with
>> these huge number of k-points ?
>>
>> On a single node, with limited memory bandwidth, mpi will NOT help. It
>> suffers from the same memory bandwidth limit.
>> For your "small" calculations you can probably use a few nodes on a
>> supercomputer (certainly not too many) and couple them via mpi to get
>> a single k-point done a bit faster. This depends on the hardware
>> (network speed and I/O ) and again memory bandwidth. Note: ELPA is
>> mandatory for fast mpi-calculations !!
>>
>> At http://www.wien2k.at/reg_user/benchmark/ you can find my benchmarks
>> for the I9-14900K. It shows the identical limitations due to memory.
>> In fact I use often OMP=8 and only 1 k-job on such a machines (or OMP4
>> and 2-k parallel - not much difference). Therefore my recommendations
>> for new PCs would be to use a processor with less cores, but maybe buy
>> more of them and couple them for k-parallel.
>>
>> The memory bandwidth problem is related to all "linear algebra" tasks,
>> i.e. the matrix diagonalization.
>>
>> For sure, a "good supercomputer" should give you overall a better
>> performance, but for such "small cases", don't expect too much. While
>> the memory bandwidth is often less problematic with Xeon type (or AMD)
>> cores, most supercomputers suffer either from network or I/O
>> limitations. And k-parallel jobs are quite I/O intensive (I hope you
>> ALWAYS use a local SCRATCH directory ?).
>> Also note: the single core performance is usually SLOWER than what you
>> can get on an I9-14900K PC.
>>
>> The real benefit of a supercomputer + mpi is its "unlimited" memory
>> (and sometimes, that it does not cost you any real money). You can do
>> unit cells with several hundreds of atoms ....
>>
>> The passwordless ssh should not be a problem on a reasonable slurm
>> machine - the problem is the slow network leading to timeouts in ssh
>> connections....
>>
>>> - I am wondering if there is a realistic speedup when using mpi? Can 
>>> I have e.g. 10x speedup only from mpi, compared to single core? Will 
>>> speedup then multiply with k-parallel?
>>>
>>> - Does mpi on a single node also suffer from the memory bandwidth 
>>> (related to the number of memory channels on the chipset/mainboard)?
>>>
>>> - Has anyone been able to find a workaround for the passwordless ssh 
>>> for running on a cluster/supercomputer?
>>>
>>> - Is the memory bandwidth problem intrinsic to LAPW, or specific to 
>>> WIEN2k? With two memory channels on desktop machines, the k-parallel 
>>> speedup is only up to something like 4 cores (depending on OMP a 
>>> bit). Actually WIEN2k speed does increase quite a lot with the RAM 
>>> speed (e.g. DDR5 7200) -- this is the case on i9-14900 that I am 
>>> using in the office.
>>>
>>> For few years I have been using an older slurm cluster, and jobs 
>>> typically crash when using more that 8 or 10 nodes with k-parallel 
>>> (also with small OMP, I think this has been discussed in the mailing 
>>> list). In my case each node has 8 cores and 4 memory channels, so it 
>>> can do 8 k- parallel jobs with practically linear speedup. I would 
>>> say that in general extra speedup with OMP is not significant, and 
>>> the real speedup is only with k-parallel (but as motioned it is 
>>> limited by the effective number of nodes, probably because slurm does 
>>> not like too many passwordless ssh connections).
>>>
>>> I am asking because I am wondering if an effort to setup WIEN2k on a 
>>> supercomputer makes any sense at all. Having a bit faster single core 
>>> might not be worth the effort.
>>>
>>> Best,
>>> Lukasz
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/ 
>>> wien at zeus.theochem.tuwien.ac.at/index.html
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/ 
> wien at zeus.theochem.tuwien.ac.at/index.html

-- 
-----------------------------------------------------------------------
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------