[Wien] speedup with mpi
Peter Blaha
peter.blaha at tuwien.ac.at
Fri Nov 15 15:06:13 CET 2024
Look at the benchmark page mentioned in my last email:
MPI-parallel benchmark: NMAT=11571, real, full diagonalization
Intel Core I9-14900k (8 cores at 6.0 +12 cores at 3.4GHz), oneapi-2021.1.1
(wall times)
serial code, OMP=8:
mpi-benchmark.output1: TIME HAMILT (CPU) = 28.8, HNS =
44.5, HORB = 0.0, DIAG = 83.7
mpi-benchmark.output1: TIME HAMILT (WALL) = 3.7, HNS =
5.6, HORB = 0.0, DIAG = 19.8
> SUM OF WALL CLOCK TIMES: 29.4 (INIT =
0.3 + K-POINTS = 29.1)
mpi-code (Elpa) OMP=1, 8 mpi-jobs:
mpi-benchmark.output1_1: TIME HAMILT (CPU) = 4.3, HNS =
3.3, HORB = 0.0, DIAG = 42.7
mpi-benchmark.output1_1: TIME HAMILT (WALL) = 4.3, HNS =
3.3, HORB = 0.0, DIAG = 42.7
> SUM OF WALL CLOCK TIMES: 51.0 (INIT =
0.3 + K-POINTS = 50.6)
The crucial quantity is not the number of atoms, but the matrix size (x
lapw1 -nmat_only), they are correlated but depending on the type of
atoms (s,p,d,f-atoms), the crystal structure (open-or closed packed;
nn-distances --> RMTs) and the desired precision there could easily be a
factor of 10 or more in the matrix size (for 2 cases with same number of
atoms).
Our mpi-benchmark has NMAT=11500. On a single I9-14900K the OMP version
is almost a factor of 2 faster than 8 mpi. Why ? Check the partial timings:
Hamilt: the difference between OMP and mpi is small, both scale well
HNS: identical time
DIAG: mpi is 2 x slower. This is not due to bad scaling, but the
parallel diagonalization algorithm is more than twice slower than the
sequential one.
On a supercomputer you usually have: more cores/node and many more nodes.
OMP: I have no recent benchmark, but would expect that the
OMP-parallelization on a single node does NOT scale particularly well if
you have many more cores, but this may be "mkl-version" dependent (I
always said OMP=2 is good, 4 maybe acceptable, 8 useless. But with the
latest onepi versions: OMP=4 still good, 8 is acceptable.)
mpi: The diagonalization algorithm should be much better parallelizable,
thus I would expect it scales MUCH better with more cores (still memory
bound) or nodes (little degradation on a fast network).
So this NMAT=11000 case should definitely run fine on 64 cores (8 times
faster than on 8), maybe even 128 or 255 give still some speedup.
However, there is a limit in the number of cores/matrixsize.
NMAT / sqrt(n-cores) > ~1000 (the exact number you have to test for your
hardware, and then it depends what degradation you still accept - I've
seen people running on 128 instead of 64 cores, because it is 10% faster
(instead of the desired 100%) - and I've seen people running the same
case on 512 cores and it was 30% SLOWER ! (they would still use 512
cores ...)).
This is because the large matrix gets decomposed into n x n submatrices
and the dimension of these submatrices should not go below 1000.
So in essence: I cannot give you a definite answer to your questions.
You have to find out yourself on the specific hardware and the specific
case.
PS:
> My recent case has 50 atoms, and 169 k-point klist in the SCF. But
> normally I have less atoms. I am not sure if I can reduce the klist, I
> am interested in SOC calculations and band structure -- this can be
> tested of course.
Again: from just 50 atoms and 169k-points: I cannot judge this: It
depends on metal/nonmetal, symmetry (still cubic (or tetragonal with
SOC) or P1 without any symmetry ? "Nasty fermi surface or not ?
The ONLY way to do this is:
init; init_so
run[sp] -so with a smaller k-mesh
save k1
x kgen (better k-mesh)
run...
Your k-mesh was converged, it should stop after 3 iterations (because
the density did not change), if not (and :DIS was big in the first
cycles) ---> continue later on with an even denser mesh.
Very often, the density (potential) converges with a small number of
k-points, and only for a DOS or bandstructure you use a denser mesh.
PPS: The k-mesh should always "correlate" with other parameters of such
calculations like RKMAX, GMAX, FFT-grid, HDLOs, L-VNS, ...
I've tried to balance this with the -prec 0-3(n) option in init_lapw,
but of course this can only be an "averaged" recommendation.
>
> Assuming a single k-point: how much faster mpi on the cluster can really
> be (assuming perfect bandwidth)?
>
> Would it be e.g. 20x faster for 20 atoms slab? Does it correlate with
> number of atoms in the unit cell?
>
> Will mpi speed scale like OMP on a single node (until the RAM and inter-
> node bandwidth limit)?
>
> Is it in general true that for WIEN2k the mpi speedup is roughly the
> same like OMP, but just allows distributing jobs to various nodes (thus
> fixing the RAM bandwidth issue)?
>
> Yes, I do use SCRATCH directory on the cluster.
>
> Best,
> Lukasz
>
>
>
>
>
> On 2024-11-15 10:22, Peter Blaha wrote:
>>> Does anyone has experience in running WIEN2k on a cluster or
>>> supercomputer with both k-parallel and mpi? My interest is in band
>>> structure calculations of large slabs (e.g. >20 atoms). I would
>>> appreciate any comment/remark.
>>
>> What you call "large slab" (20 atoms), I'd call "small calculation".
>>
>> What makes your calculations expensive is the tremendous number of
>> k-points you are using in general. Do you do the scf cycle also with
>> these huge number of k-points ?
>>
>> On a single node, with limited memory bandwidth, mpi will NOT help. It
>> suffers from the same memory bandwidth limit.
>> For your "small" calculations you can probably use a few nodes on a
>> supercomputer (certainly not too many) and couple them via mpi to get
>> a single k-point done a bit faster. This depends on the hardware
>> (network speed and I/O ) and again memory bandwidth. Note: ELPA is
>> mandatory for fast mpi-calculations !!
>>
>> At http://www.wien2k.at/reg_user/benchmark/ you can find my benchmarks
>> for the I9-14900K. It shows the identical limitations due to memory.
>> In fact I use often OMP=8 and only 1 k-job on such a machines (or OMP4
>> and 2-k parallel - not much difference). Therefore my recommendations
>> for new PCs would be to use a processor with less cores, but maybe buy
>> more of them and couple them for k-parallel.
>>
>> The memory bandwidth problem is related to all "linear algebra" tasks,
>> i.e. the matrix diagonalization.
>>
>> For sure, a "good supercomputer" should give you overall a better
>> performance, but for such "small cases", don't expect too much. While
>> the memory bandwidth is often less problematic with Xeon type (or AMD)
>> cores, most supercomputers suffer either from network or I/O
>> limitations. And k-parallel jobs are quite I/O intensive (I hope you
>> ALWAYS use a local SCRATCH directory ?).
>> Also note: the single core performance is usually SLOWER than what you
>> can get on an I9-14900K PC.
>>
>> The real benefit of a supercomputer + mpi is its "unlimited" memory
>> (and sometimes, that it does not cost you any real money). You can do
>> unit cells with several hundreds of atoms ....
>>
>> The passwordless ssh should not be a problem on a reasonable slurm
>> machine - the problem is the slow network leading to timeouts in ssh
>> connections....
>>
>>> - I am wondering if there is a realistic speedup when using mpi? Can
>>> I have e.g. 10x speedup only from mpi, compared to single core? Will
>>> speedup then multiply with k-parallel?
>>>
>>> - Does mpi on a single node also suffer from the memory bandwidth
>>> (related to the number of memory channels on the chipset/mainboard)?
>>>
>>> - Has anyone been able to find a workaround for the passwordless ssh
>>> for running on a cluster/supercomputer?
>>>
>>> - Is the memory bandwidth problem intrinsic to LAPW, or specific to
>>> WIEN2k? With two memory channels on desktop machines, the k-parallel
>>> speedup is only up to something like 4 cores (depending on OMP a
>>> bit). Actually WIEN2k speed does increase quite a lot with the RAM
>>> speed (e.g. DDR5 7200) -- this is the case on i9-14900 that I am
>>> using in the office.
>>>
>>> For few years I have been using an older slurm cluster, and jobs
>>> typically crash when using more that 8 or 10 nodes with k-parallel
>>> (also with small OMP, I think this has been discussed in the mailing
>>> list). In my case each node has 8 cores and 4 memory channels, so it
>>> can do 8 k- parallel jobs with practically linear speedup. I would
>>> say that in general extra speedup with OMP is not significant, and
>>> the real speedup is only with k-parallel (but as motioned it is
>>> limited by the effective number of nodes, probably because slurm does
>>> not like too many passwordless ssh connections).
>>>
>>> I am asking because I am wondering if an effort to setup WIEN2k on a
>>> supercomputer makes any sense at all. Having a bit faster single core
>>> might not be worth the effort.
>>>
>>> Best,
>>> Lukasz
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/
>>> wien at zeus.theochem.tuwien.ac.at/index.html
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/
> wien at zeus.theochem.tuwien.ac.at/index.html
--
-----------------------------------------------------------------------
Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------
More information about the Wien
mailing list